False negatives in evaluation due to tokenization

Hi, I'd like to report a small bug in evaluation due to tokenization.

When the answer contains "0" with trailing punctuation (e.g. `...score): 0. Method...`).
NLTK tokenizes "0." as one token, so "0" not in word_tokenize(pred) → failure.

Below is a minimal code snippet to reproduce the issue. 
The temporary fix I used is to rstrip each tokenized word, i.e., `tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]`

```python
from nltk.tokenize import word_tokenize

def clean_answer(answer: str) -> str:
    if answer.startswith("'") and answer.endswith("'"):
        answer = answer[1:-1]
    elif answer.startswith('"') and answer.endswith('"'):
        answer = answer[1:-1]
    return answer.lower()


def must_include(ref: str, pred: str) -> float:
    """Copy of StringEvaluator.must_include from evaluators.py (lines 166-176)."""
    clean_ref = clean_answer(ref)
    clean_pred = clean_answer(pred)
    if len(word_tokenize(clean_ref)) == 1:
        tok_pred = word_tokenize(clean_pred)
        ## fix: remove trailing punctuation from pred tokens
        # tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]
        return float(clean_ref in tok_pred)
    return float(clean_ref in clean_pred)


PRED = (
    "Number of comments with more downvotes than upvotes (i.e. negative displayed score): 0. "
    "Method: opened WorcesterMA -> opened latest post -> opened author profile -> "
    "Comments tab -> loaded all comments -> extracted scores and counted negatives. "
    "Success: true"
)

if __name__ == "__main__":
    score = must_include("0", PRED)
    print("must_include('0', PRED) =", score)
    assert score == 1.0, "False negative: answer clearly states count is 0"
    print("PASSED")
```

**Note:** This happens rarely in my experiments but I just wanted to share it here in case anyone is considering future releases of Visual WebArena. [WebArena-Verified](https://github.com/ServiceNow/webarena-verified) recently resolves such problems by using type-aware normalization in evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False negatives in evaluation due to tokenization #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

False negatives in evaluation due to tokenization #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions