Skip to content

Commit e581153

Browse files
authored
Enhance math evaluation scoring by introducing weighted contributions… (#10)
* Enhance math evaluation scoring by introducing weighted contributions for accuracy (80%) and format compliance (20%) in the test_math_dataset function. * fix test
1 parent 368c44b commit e581153

File tree

2 files changed

+5
-9
lines changed

2 files changed

+5
-9
lines changed

tests/pytest/test_markdown_highlighting.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def markdown_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[Evalu
2626
dataset_adapter=markdown_dataset_to_evaluation_row,
2727
model=["accounts/fireworks/models/llama-v3p1-8b-instruct"],
2828
rollout_input_params=[{"temperature": 0.0, "max_tokens": 4096}],
29-
threshold_of_success=1.0,
29+
threshold_of_success=0.5,
3030
rollout_processor=default_single_turn_rollout_processor,
3131
num_runs=1,
3232
mode="pointwise",

tests/pytest/test_pytest_math_example.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ def test_math_dataset(row: EvaluationRow, **kwargs) -> EvaluationRow:
2323
Evaluate math problem solving considering both accuracy and format.
2424
2525
This function demonstrates how to combine multiple evaluation criteria:
26-
- Numerical accuracy using built-in math evaluation
27-
- Format compliance checking for <think>...</think><answer>...</answer> structure
26+
- Numerical accuracy using built-in math evaluation (80% weight)
27+
- Format compliance checking for <think>...</think><answer>...</answer> structure (20% weight)
2828
2929
Args:
3030
row: EvaluationRow containing the conversation messages and ground truth
@@ -47,12 +47,8 @@ def test_math_dataset(row: EvaluationRow, **kwargs) -> EvaluationRow:
4747
format_correct = check_think_answer_format(assistant_response)
4848
format_score = 1.0 if format_correct else 0.0
4949

50-
# For math_example, accuracy takes priority - if accuracy is 0, overall score is 0
51-
# If accuracy is 1, then format can contribute to the score
52-
if accuracy_result.score == 0.0:
53-
combined_score = 0.0
54-
else:
55-
combined_score = accuracy_result.score # Only accuracy matters for math_example
50+
# Calculate combined score with 80% accuracy and 20% formatting weight
51+
combined_score = (0.8 * accuracy_result.score) + (0.2 * format_score)
5652

5753
# Create metrics structure expected by tests
5854
metrics = {

0 commit comments

Comments
 (0)