-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Hi, I want to train Lora on 4*A5500(24G) but encounter the OOM error.
Command
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora-slot.yaml"
accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
train_configs/test_lora-slot.yaml
model_name: "flux-dev"
data_config:
train_batch_size: 1
num_workers: 4
img_size: 512
img_dir: images-slot/
random_ratio: true # support multi crop preprocessing
report_to: wandb
train_batch_size: 1
output_dir: lora/
max_train_steps: 100000
learning_rate: 1e-5
lr_scheduler: constant
lr_warmup_steps: 10
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1e-8
max_grad_norm: 1.0
logging_dir: logs
mixed_precision: "bf16"
checkpointing_steps: 2500
checkpoints_total_limit: 10
tracker_project_name: lora_test
resume_from_checkpoint: latest
gradient_accumulation_steps: 2
rank: 16
single_blocks: "1,2,3,4"
double_blocks: null
#disable_sampling: false
#sample_every: 250 # sample every this many steps
#sample_width: 1024
#sample_height: 1024
#sample_steps: 20
Part of error message
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 355, in <module>
[rank2]: main()
[rank2]: File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 178, in main
[rank2]: dit, optimizer, _, lr_scheduler = accelerator.prepare(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1284, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/__init__.py", line 181, in initialize
[rank2]: engine = DeepSpeedEngine(args=args,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
[rank2]: self._configure_distributed_model(model)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
[rank2]: self.module.to(self.device)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank2]: return self._apply(convert)
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank2]: param_applied = fn(param)
[rank2]: ^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank2]: return t.to(
[rank2]: ^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB. GPU 2 has a total capacity of 23.67 GiB of which 17.25 MiB is free. Including non-PyTorch memory, this process has 23.65 GiB memory in use. Of the allocated memory 23.39 GiB is allocated by PyTorch, and 13.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0120 14:21:48.179000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1575915 closing signal SIGTERM
E0120 14:21:49.497000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1575912) of binary: /home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/python
Traceback (most recent call last):
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
deepspeed_launcher(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
distrib_run.run(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_flux_lora_deepspeed.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1575913)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1575914)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1575912)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================