LCB official scorer instead of skythoughts #130

slimfrkha · 2025-06-13T10:30:44Z

Issue

Current code cannot reproduce official LCB scores from open source models.
Example: qwen3-14b thinking mode on LCB v5.

Reported in paper (temperature=0.6, top_p=0.95, top-k=20, max_new_tokens=32768): 63.5
current code (temperature=0.6, top_p=0.95, max_new_tokens=32768, n_repeat=2): 56.15

Score is trailing 7 pts behind. This can't be explained by seeding / randomness of generations.

Prompt is not formatted the same as LCB paper / official git repo
current code evaluator (originally from novaSky/Skythoughts) is different from official LCB github repo

Apply changes from official LCB github repo for prompt formatting and Evaluator.

results after fix (temperature=0.6, top_p=0.95, max_new_tokens=32768, n_repeat=2): 61.00

This is more inline with official results. Difference in score is small and is related probably either to seeding or different n_repeat value.

… LCB scorer

fix(livecodebench): replace nova sky skythoughts scorer with official…

4d81d91

… LCB scorer