Qwen3-4B Results Reproducibility

Hi LatentMAS authors,

I'm currently unable to replicate the values presented in the full arXiv for Qwen3-4B. I've tried gsm8k and medqa with varying parameters, but in all my runs, the baseline somehow outperforms LatentMAS in accuracy. Could you share the specific hyperparameters you used for Qwen3-4B on an example benchmark for reproducibility (i.e. number of latent steps, latent realignment enabled/disabled, maximum token generation)? If you could share the full config for one specific benchmark to reproduce your numbers, it would greatly help - right now, I'm running medqa on the single-agent baseline which was reported to have 47% accuracy on your paper, though achieves ~66% locally for me, which is even higher than my LatentMAS run on 40 steps.

Sharing the full commands you ran to reproduce Qwen3-4B would be awesome.

Edit: I noticed that this is related to #29 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-4B Results Reproducibility #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3-4B Results Reproducibility #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions