-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
第1步,export 数据到cached_dataset.
第2步,使用前一步的cached_dataset, 模型选择qwen3-reranker-4b。
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
ms-swift==3.11.0
Additional context
Add any other context about the problem here(在这里补充其他信息)
执行脚本:
第1步:
swift export \
--model /mnt/modelhub/Qwen3-Reranker-4B \
--dataset /mnt/reranker_dev_seg4_fix.jsonl \
--val_dataset /mnt/reranker_dev_seg4_fix.jsonl \
--dataset_num_proc 12 \
--to_cached_dataset true \
--output_dir /mnt/cached_dataset/v1
第2步:
node_rank=0
nnodes=1
nproc_per_node=2
export LISTWISE_RERANKER_MIN_GROUP_SIZE=2
export LISTWISE_RERANKER_TEMPERATURE=0.1
export MAX_NEGATIVE_SAMPLES=3
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
OUTPUT_DIR="/mnt/cached_dataset_demo"
CUDA_VISIBLE_DEVICES=0,1 \
NNODES=$nnodes \
NODE_RANK=$node_rank \
MASTER_ADDR=127.0.0.1 \
MASTER_PORT=29520 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
--model /mnt/modelhub/Qwen3-Reranker-4B \
--task_type generative_reranker \
--loss_type listwise_generative_reranker \
--train_type full \
--padding_side left \
--padding_free true \
--attn_impl flash_attention_2 \
--cached_dataset '/mnt/cached_dataset/v1/train' \
--cached_val_dataset '/mnt/cached_dataset/v1/val' \
--eval_strategy steps \
--output_dir $OUTPUT_DIR \
--eval_steps 10 \
--num_train_epochs 1 \
--save_steps 500 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 5e-6 \
--max_grad_norm 0.5 \
--label_names labels \
--max_length 24000 \
--truncation_strategy delete \
--dataloader_drop_last true \
--report_to tensorboard \
--deepspeed zero2_offload \
--warmup_ratio 0.001 \
--logging_steps 1 \
--eval_on_start true \
--logging_dir $OUTPUT_DIR/tensorboard
报错信息:
[INFO:swift] max_length: 24000
[INFO:swift] response_prefix: '\n\n\n\n'
[INFO:swift] agent_template: hermes
[INFO:swift] Start time of running main: 2025-12-10 11:08:22.320906
[INFO:swift] swift.version: 3.11.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:24<00:00, 42.05s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 364, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in run
[rank0]: train_dataset, val_dataset = self._prepare_dataset()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 125, in _prepare_dataset
[rank0]: train_datasets, val_datasets = get_cached_dataset(self.args)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 170, in get_cached_dataset
[rank0]: train_datasets.append(_select_dataset(load_from_disk(train_path), args.max_length))
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in _select_dataset
[rank0]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in
[rank0]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank0]: TypeError: '<=' not supported between instances of 'list' and 'int'
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank1]: sft_main()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 364, in sft_main
[rank1]: return SwiftSft(args).main()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank1]: result = self.run()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in run
[rank1]: train_dataset, val_dataset = self._prepare_dataset()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 125, in _prepare_dataset
[rank1]: train_datasets, val_datasets = get_cached_dataset(self.args)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 170, in get_cached_dataset
[rank1]: train_datasets.append(_select_dataset(load_from_disk(train_path), args.max_length))
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in _select_dataset
[rank1]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in
[rank1]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank1]: TypeError: '<=' not supported between instances of 'list' and 'int'
[rank0]:[W1210 11:08:23.398760675 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1210 11:08:24.923000 72257 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 72330 closing signal SIGTERM
E1210 11:08:25.187000 72257 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 72329) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================