Hi LatentMAS authors,
I'm currently unable to replicate the values presented in the full arXiv for Qwen3-4B. I've tried gsm8k and medqa with varying parameters, but in all my runs, the baseline somehow outperforms LatentMAS in accuracy. Could you share the specific hyperparameters you used for Qwen3-4B on an example benchmark for reproducibility (i.e. number of latent steps, latent realignment enabled/disabled, maximum token generation)? If you could share the full config for one specific benchmark to reproduce your numbers, it would greatly help - right now, I'm running medqa on the single-agent baseline which was reported to have 47% accuracy on your paper, though achieves ~66% locally for me, which is even higher than my LatentMAS run on 40 steps.
Sharing the full commands you ran to reproduce Qwen3-4B would be awesome.
Edit: I noticed that this is related to #29
Hi LatentMAS authors,
I'm currently unable to replicate the values presented in the full arXiv for Qwen3-4B. I've tried gsm8k and medqa with varying parameters, but in all my runs, the baseline somehow outperforms LatentMAS in accuracy. Could you share the specific hyperparameters you used for Qwen3-4B on an example benchmark for reproducibility (i.e. number of latent steps, latent realignment enabled/disabled, maximum token generation)? If you could share the full config for one specific benchmark to reproduce your numbers, it would greatly help - right now, I'm running medqa on the single-agent baseline which was reported to have 47% accuracy on your paper, though achieves ~66% locally for me, which is even higher than my LatentMAS run on 40 steps.
Sharing the full commands you ran to reproduce Qwen3-4B would be awesome.
Edit: I noticed that this is related to #29