Add robustness testing integration with BenchDrift for m-programs #15

shailja-thakur · 2025-12-13T07:56:18Z

Summary

This PR adds the ability to test Mellea m-program robustness by integrating with BenchDrift- semantic variation generation and evaluation pipeline. Users can now systematically evaluate how consistently their m-programs answer semantically equivalent variations of a problem.

What This Enables

Generate semantic variations of a problem (different phrasings, same meaning)
Execute m-programs on all variations to measure consistency
Measure pass rates, drift patterns, and identify failure modes
Understand where m-programs break and where they perform well

Key Components

run_benchdrift_pipeline(): Orchestrates BenchDrift's 3-stage pipeline (generate variations → execute m-program → evaluate)
MelleaModelClientAdapter: Bridges Mellea m-programs to BenchDrift's test framework
analyze_robustness_from_probes(): Computes robustness metrics from test results
Configurable variation strategies (generic, cluster-based, persona-based, long-context)

…ting - Add variation_types parameter to run_benchdrift_pipeline() to allow users to customize which semantic variation types to generate (generic, cluster_variations, persona, long_context) - Update test/1_test_robustness_testing.py to demonstrate variation_types usage - Add docs/ROBUSTNESS_TESTING.md with comprehensive documentation for robustness testing workflow - Enables fine-grained control over robustness testing configurations 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add robustness testing integration with BenchDrift for m-programs #15

Add robustness testing integration with BenchDrift for m-programs #15

Uh oh!

shailja-thakur commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add robustness testing integration with BenchDrift for m-programs #15

Are you sure you want to change the base?

Add robustness testing integration with BenchDrift for m-programs #15

Uh oh!

Conversation

shailja-thakur commented Dec 13, 2025

Summary

What This Enables

Key Components

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant