Skip to content

Results of eval_n_turn not match the paper #20

@hansjohn

Description

@hansjohn

I run the eval_n_turn.py to reproduce the single turn handicap sql results

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo \
    --handicap \
    --verbose 

i use this script to compute the success rate:

import json
from re import T
result_file_path = './logs/experiments/ic_sql_multiturn_gpt-3.5-turbo_1_turns.json'
with open(result_file_path, 'r') as f:
    result = { key: {'success':0, 'total':0} for key in ['easy', 'medium', 'hard', 'extra','all'] }
    data = json.load(f)
    
    for index in data.keys():
        if data[index]['summary']['max_reward'] == 1.0:
            result[data[index]['hardness']]['success']+=1
            result['all']['success']+=1
        result[data[index]['hardness']]['total']+=1
        result['all']['total']+=1

    for key in result.keys():
        success = result[key]['success']
        total = result[key]['total']
        print(f"{key} Success rate: {success}/{total} ({success/total:.2%})")

get this result:

easy Success rate: 202/248 (81.45%)
medium Success rate: 281/446 (63.00%)
hard Success rate: 75/174 (43.10%)
extra Success rate: 37/166 (22.29%)
all Success rate: 595/1034 (57.54%)

It is lower than the result in paper.
Did I do something wrong?

I also run the eval_n_turn.py to reproduce the single turn sql results.

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo

Result is here:

easy Success rate: 41/248 (16.53%)
medium Success rate: 28/446 (6.28%)
hard Success rate: 3/174 (1.72%)
extra Success rate: 2/166 (1.20%)
all Success rate: 74/1034 (7.16%)

Did I do something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions