Skip to content

when read cached_dataset throws error TypeError: '<=' not supported between instances of 'list' and 'int' #6981

@phoenixbai

Description

@phoenixbai

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
第1步,export 数据到cached_dataset.
第2步,使用前一步的cached_dataset, 模型选择qwen3-reranker-4b。

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
ms-swift==3.11.0

Additional context
Add any other context about the problem here(在这里补充其他信息)
执行脚本:
第1步:

swift export \
    --model /mnt/modelhub/Qwen3-Reranker-4B \
    --dataset /mnt/reranker_dev_seg4_fix.jsonl \
    --val_dataset /mnt/reranker_dev_seg4_fix.jsonl \
    --dataset_num_proc 12 \
    --to_cached_dataset true \
    --output_dir /mnt/cached_dataset/v1

第2步:

node_rank=0
nnodes=1
nproc_per_node=2
export LISTWISE_RERANKER_MIN_GROUP_SIZE=2
export LISTWISE_RERANKER_TEMPERATURE=0.1
export MAX_NEGATIVE_SAMPLES=3
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
OUTPUT_DIR="/mnt/cached_dataset_demo"
CUDA_VISIBLE_DEVICES=0,1 \
NNODES=$nnodes \
NODE_RANK=$node_rank \
MASTER_ADDR=127.0.0.1 \
MASTER_PORT=29520 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model /mnt/modelhub/Qwen3-Reranker-4B \
    --task_type generative_reranker \
    --loss_type listwise_generative_reranker \
    --train_type full \
    --padding_side left \
    --padding_free true \
    --attn_impl flash_attention_2 \
    --cached_dataset '/mnt/cached_dataset/v1/train' \
    --cached_val_dataset '/mnt/cached_dataset/v1/val' \
    --eval_strategy steps \
    --output_dir $OUTPUT_DIR \
    --eval_steps 10 \
    --num_train_epochs 1 \
    --save_steps 500 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 5e-6 \
    --max_grad_norm 0.5 \
    --label_names labels \
    --max_length 24000 \
    --truncation_strategy delete \
    --dataloader_drop_last true \
    --report_to tensorboard \
    --deepspeed zero2_offload \
    --warmup_ratio 0.001 \
    --logging_steps 1 \
    --eval_on_start true \
    --logging_dir $OUTPUT_DIR/tensorboard 

报错信息:

[INFO:swift] max_length: 24000
[INFO:swift] response_prefix: '\n\n\n\n'
[INFO:swift] agent_template: hermes
[INFO:swift] Start time of running main: 2025-12-10 11:08:22.320906
[INFO:swift] swift.version: 3.11.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:24<00:00, 42.05s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank0]: sft_main()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 364, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in run
[rank0]: train_dataset, val_dataset = self._prepare_dataset()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 125, in _prepare_dataset
[rank0]: train_datasets, val_datasets = get_cached_dataset(self.args)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 170, in get_cached_dataset
[rank0]: train_datasets.append(_select_dataset(load_from_disk(train_path), args.max_length))
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in _select_dataset
[rank0]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank0]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in
[rank0]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank0]: TypeError: '<=' not supported between instances of 'list' and 'int'
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py", line 20, in
[rank1]: sft_main()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 364, in sft_main
[rank1]: return SwiftSft(args).main()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank1]: result = self.run()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 183, in run
[rank1]: train_dataset, val_dataset = self._prepare_dataset()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/train/sft.py", line 125, in _prepare_dataset
[rank1]: train_datasets, val_datasets = get_cached_dataset(self.args)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 170, in get_cached_dataset
[rank1]: train_datasets.append(_select_dataset(load_from_disk(train_path), args.max_length))
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in _select_dataset
[rank1]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank1]: File "/opt/conda/lib/python3.10/site-packages/swift/llm/infer/utils.py", line 160, in
[rank1]: idxs = [i for i, length in enumerate(dataset['length']) if length <= max_length]
[rank1]: TypeError: '<=' not supported between instances of 'list' and 'int'
[rank0]:[W1210 11:08:23.398760675 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1210 11:08:24.923000 72257 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 72330 closing signal SIGTERM
E1210 11:08:25.187000 72257 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 72329) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 896, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions