-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Hi, I noticed an issue with the must_include evaluation logic. While running the WebArena tasks through this repo, I found an inconsistency through task 231’s evaluation.
For task 231, the attempted answer was stop [000000170], and the config for this task specified "must_include": ["170"]. This answer was evaluated as a fail in visualwebarena, even though it contains “170” as required by the must_include criteria. Interestingly, the exact same answer was marked as a pass in webarena.
It seems there’s an inconsistency in how must_include conditions are evaluated, potentially leading to incorrect pass/fail markings. It would be great to refine this logic to ensure consistent evaluation based on the specified requirements.
In webarena evaluator
{'raw_action': 'stop [000000170]', 'full response': 'The objective is to find the order number of the most recent cancelled order. Observing the "Recent Orders" section, the order with the status "Canceled" is listed first. The order number for this is "000000170". \n\nIn summary, the next action I will perform is ```stop [000000170]```.'}
2024-11-03 15:09:10,124 - INFO - [Result] (PASS) /tmp/tmpz5cvc9pf/231.json
In visualwebarena evaluator
{'raw_action': 'stop [000000170]', 'full response': 'Let's think through this step-by-step: 1. We're on the account page, which shows recent orders. 2. The objective is to find the order number of the most recent cancelled order. 3. Looking at the "Recent Orders" section, we can see several orders listed with their status. 4. The most recent cancelled order is the first one in the list, with Order # 000000170. 5. This order is dated 5/17/23 and has a status of "Canceled". Therefore, we have found the information we need to answer the objective. In summary, the next action I will perform is ```stop [000000170]```.'}
2024-11-03 15:09:10,124 - INFO - [Result] (FAIL) /tmp/tmpz5cvc9pf/231.json