-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
Hi, I'd like to report a small bug in evaluation due to tokenization.
When the answer contains "0" with trailing punctuation (e.g. ...score): 0. Method...).
NLTK tokenizes "0." as one token, so "0" not in word_tokenize(pred) → failure.
Below is a minimal code snippet to reproduce the issue.
The temporary fix I used is to rstrip each tokenized word, i.e., tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]
from nltk.tokenize import word_tokenize
def clean_answer(answer: str) -> str:
if answer.startswith("'") and answer.endswith("'"):
answer = answer[1:-1]
elif answer.startswith('"') and answer.endswith('"'):
answer = answer[1:-1]
return answer.lower()
def must_include(ref: str, pred: str) -> float:
"""Copy of StringEvaluator.must_include from evaluators.py (lines 166-176)."""
clean_ref = clean_answer(ref)
clean_pred = clean_answer(pred)
if len(word_tokenize(clean_ref)) == 1:
tok_pred = word_tokenize(clean_pred)
## fix: remove trailing punctuation from pred tokens
# tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]
return float(clean_ref in tok_pred)
return float(clean_ref in clean_pred)
PRED = (
"Number of comments with more downvotes than upvotes (i.e. negative displayed score): 0. "
"Method: opened WorcesterMA -> opened latest post -> opened author profile -> "
"Comments tab -> loaded all comments -> extracted scores and counted negatives. "
"Success: true"
)
if __name__ == "__main__":
score = must_include("0", PRED)
print("must_include('0', PRED) =", score)
assert score == 1.0, "False negative: answer clearly states count is 0"
print("PASSED")Note: This happens rarely in my experiments but I just wanted to share it here in case anyone is considering future releases of Visual WebArena. WebArena-Verified recently resolves such problems by using type-aware normalization in evaluation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels