-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Description
I run the eval_n_turn.py to reproduce the single turn handicap sql results
python -m experiments.eval_n_turn \
--data_path ./data/sql/spider/ic_spider_dev.json \
--dialogue_limit 5 \
--env sql \
--image_name docker-env-sql \
--log_dir logs/experiments \
--max_turns 1 \
--policy chat \
--template game_sql \
--model gpt-3.5-turbo \
--handicap \
--verbose
i use this script to compute the success rate:
import json
from re import T
result_file_path = './logs/experiments/ic_sql_multiturn_gpt-3.5-turbo_1_turns.json'
with open(result_file_path, 'r') as f:
result = { key: {'success':0, 'total':0} for key in ['easy', 'medium', 'hard', 'extra','all'] }
data = json.load(f)
for index in data.keys():
if data[index]['summary']['max_reward'] == 1.0:
result[data[index]['hardness']]['success']+=1
result['all']['success']+=1
result[data[index]['hardness']]['total']+=1
result['all']['total']+=1
for key in result.keys():
success = result[key]['success']
total = result[key]['total']
print(f"{key} Success rate: {success}/{total} ({success/total:.2%})")
get this result:
easy Success rate: 202/248 (81.45%)
medium Success rate: 281/446 (63.00%)
hard Success rate: 75/174 (43.10%)
extra Success rate: 37/166 (22.29%)
all Success rate: 595/1034 (57.54%)
It is lower than the result in paper.
Did I do something wrong?
I also run the eval_n_turn.py to reproduce the single turn sql results.
python -m experiments.eval_n_turn \
--data_path ./data/sql/spider/ic_spider_dev.json \
--dialogue_limit 5 \
--env sql \
--image_name docker-env-sql \
--log_dir logs/experiments \
--max_turns 1 \
--policy chat \
--template game_sql \
--model gpt-3.5-turbo
Result is here:
easy Success rate: 41/248 (16.53%)
medium Success rate: 28/446 (6.28%)
hard Success rate: 3/174 (1.72%)
extra Success rate: 2/166 (1.20%)
all Success rate: 74/1034 (7.16%)
Did I do something wrong?
Metadata
Metadata
Assignees
Labels
No labels