This repository contains the official code, data, and benchmarks for the paper Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration. We introduce a framework to train large language models to proactively ask clarifying questions when faced with flawed or underspecified prompts.
- 🚀 Installation
- 💾 Datasets
- 📊 Evaluation
- ⚙️ Training
- ✍️ Citation
# Install the package in editable mode
cd verl
pip install -e .-
Training Data: The datasets used for Supervised Fine-Tuning and Reinforcement Learning are available in the
train/data/directory. -
Evaluation Benchmarks: Our new benchmarks, GSM-MC and GSM-MCE, can be found in the
eval/data/directory.
To evaluate a model's performance on the GSM-MC and GSM-MCE benchmarks, follow these steps.
-
Navigate to the evaluation directory:
cd eval
-
Set Environment Variables:
Configure the user agent model which will be used to answer the policy model's questions.
# set your own user agent export USER_AGENT_URL= export USER_AGENT_MODEL= export USER_AGENT_API_KEY=
-
Run the Evaluation Script:
Execute the
test.pyscript with your desired arguments.python test.py \ --model <path_to_your_model> \ --think \ # Optional: enables 'think' mode in the first round --noise # Optional: evaluates on the enhanced GSM-MCE benchmark
Note: The current evaluation script is optimized for models compatible with the vllm backend. To use other backends, you may need to modify test.py.
We also provide our scripts for both Supervised Fine-Tuning and Reinforcement Learning.
cd train
name=qwen3-1.7b-sft-think
save_path=trained_models/$name
model=Qwen/Qwen3-1.7B
train_file=data/sft_think.parquet
val_file=$train_file
epoch=1
export CUDA_VISIBLE_DEVICES=0,1
ulysses_sequence_parallel_size=2
nproc_per_node=2
torchrun --master-port=29600 --nnodes=1 --nproc_per_node=$nproc_per_node \
-m verl.trainer.fsdp_sft_trainer \
data.train_files=$train_file \
data.val_files=$val_file \
data.multiturn.enable=true \
data.multiturn.messages_key=messages \
data.micro_batch_size_per_gpu=4 \
data.train_batch_size=64 \
data.max_length=4096 \
optim.lr=5e-6 \
optim.lr_scheduler=cosine \
model.partial_pretrain=$model \
trainer.default_local_dir=$save_path \
trainer.project_name=sft-l2a \
trainer.experiment_name=$name \
trainer.logger=['console'] \
trainer.total_epochs=$epoch \
ulysses_sequence_parallel_size=$ulysses_sequence_parallel_size \
use_remove_padding=truecd train
export HYDRA_FULL_ERROR=1
export VLLM_ATTENTION_BACKEND=XFORMERS
# openai compatible api, such as vllm-server
# Here, we use Qwen3-14B as the user agent
export USER_AGENT_NAME=
export USER_AGENT_URL=
# set to True if you want to enable `think` mode [Policy model, not user agent]
export ENABLE_THINKING=False
math_train_path=data/rl.parquet
math_test_path=data/rl.parquet
model_path=Qwen/Qwen3-1.7B
name=qwen3-1.7b-rl-nothink
train_files="['$math_train_path']"
test_files="['$math_test_path']"
reward_fn_path=reward_qwen.py # if llama, use reward_llama.py
n_gpus_per_node=2
CUDA_VISIBLE_DEVICES=1,2 python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$train_files \
data.val_files=$test_files \
data.train_batch_size=256 \
data.max_prompt_length=1024 \
data.max_response_length=3072 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=$model_path \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.rollout.learn2ask=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
custom_reward_function.path=$reward_fn_path \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='learn2ask' \
trainer.experiment_name=$name \
trainer.n_gpus_per_node=$n_gpus_per_node \
trainer.nnodes=1 \
trainer.rollout_data_dir=./outputs/$name/rollout_data \
trainer.save_freq=1000000 \
trainer.test_freq=1000000 \
trainer.total_epochs=20@misc{wang2025passivecriticalthinkingfostering,
title={Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration},
author={Ante Wang and Yujie Lin and Jingyao Liu and Suhang Wu and Hao Liu and Xinyan Xiao and Jinsong Su},
year={2025},
eprint={2507.23407},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.23407},
}