Skip to content

False negatives in evaluation due to tokenization #84

@cherry979988

Description

@cherry979988

Hi, I'd like to report a small bug in evaluation due to tokenization.

When the answer contains "0" with trailing punctuation (e.g. ...score): 0. Method...).
NLTK tokenizes "0." as one token, so "0" not in word_tokenize(pred) → failure.

Below is a minimal code snippet to reproduce the issue.
The temporary fix I used is to rstrip each tokenized word, i.e., tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]

from nltk.tokenize import word_tokenize

def clean_answer(answer: str) -> str:
    if answer.startswith("'") and answer.endswith("'"):
        answer = answer[1:-1]
    elif answer.startswith('"') and answer.endswith('"'):
        answer = answer[1:-1]
    return answer.lower()


def must_include(ref: str, pred: str) -> float:
    """Copy of StringEvaluator.must_include from evaluators.py (lines 166-176)."""
    clean_ref = clean_answer(ref)
    clean_pred = clean_answer(pred)
    if len(word_tokenize(clean_ref)) == 1:
        tok_pred = word_tokenize(clean_pred)
        ## fix: remove trailing punctuation from pred tokens
        # tok_pred = [t.rstrip('.,;:!?') for t in tok_pred]
        return float(clean_ref in tok_pred)
    return float(clean_ref in clean_pred)


PRED = (
    "Number of comments with more downvotes than upvotes (i.e. negative displayed score): 0. "
    "Method: opened WorcesterMA -> opened latest post -> opened author profile -> "
    "Comments tab -> loaded all comments -> extracted scores and counted negatives. "
    "Success: true"
)

if __name__ == "__main__":
    score = must_include("0", PRED)
    print("must_include('0', PRED) =", score)
    assert score == 1.0, "False negative: answer clearly states count is 0"
    print("PASSED")

Note: This happens rarely in my experiments but I just wanted to share it here in case anyone is considering future releases of Visual WebArena. WebArena-Verified recently resolves such problems by using type-aware normalization in evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions