OOM while training lora.

Hi, I want to train Lora on 4*A5500(24G) but encounter the OOM error.

**Command**
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora-slot.yaml"

**accelerate_config.yaml**
compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

**train_configs/test_lora-slot.yaml**
model_name: "flux-dev"
data_config:
  train_batch_size: 1
  num_workers: 4
  img_size: 512
  img_dir: images-slot/
  random_ratio: true # support multi crop preprocessing
report_to: wandb
train_batch_size: 1
output_dir: lora/
max_train_steps: 100000
learning_rate: 1e-5
lr_scheduler: constant
lr_warmup_steps: 10
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1e-8
max_grad_norm: 1.0
logging_dir: logs
mixed_precision: "bf16"
checkpointing_steps: 2500
checkpoints_total_limit: 10
tracker_project_name: lora_test
resume_from_checkpoint: latest
gradient_accumulation_steps: 2
rank: 16
single_blocks: "1,2,3,4"
double_blocks: null
#disable_sampling: false
#sample_every: 250 # sample every this many steps
#sample_width: 1024
#sample_height: 1024
#sample_steps: 20

**Part of error message**
 ```bash
  [rank2]: Traceback (most recent call last):
  [rank2]:   File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 355, in <module>
  [rank2]:     main()
  [rank2]:   File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 178, in main
  [rank2]:     dit, optimizer, _, lr_scheduler = accelerator.prepare(
  [rank2]:                                       ^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1284, in prepare
  [rank2]:     result = self._prepare_deepspeed(*args)
  [rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
  [rank2]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  [rank2]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/__init__.py", line 181, in initialize
  [rank2]:     engine = DeepSpeedEngine(args=args,
  [rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
  [rank2]:     self._configure_distributed_model(model)
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
  [rank2]:     self.module.to(self.device)
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1340, in to
  [rank2]:     return self._apply(convert)
  [rank2]:            ^^^^^^^^^^^^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
  [rank2]:     module._apply(fn)
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
  [rank2]:     module._apply(fn)
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
  [rank2]:     module._apply(fn)
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
  [rank2]:     param_applied = fn(param)
  [rank2]:                     ^^^^^^^^^
  [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1326, in convert
  [rank2]:     return t.to(
  [rank2]:            ^^^^^
  [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB. GPU 2 has a total capacity of 23.67 GiB of which 17.25 MiB is free. Including non-PyTorch memory, this process has 23.65 GiB memory in use. Of the allocated memory 23.39 GiB is allocated by PyTorch, and 13.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  W0120 14:21:48.179000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1575915 closing signal SIGTERM
  E0120 14:21:49.497000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1575912) of binary: /home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/python
  Traceback (most recent call last):
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/accelerate", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
      args.func(args)
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
      deepspeed_launcher(args)
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
      distrib_run.run(args)
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
      elastic_launch(
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
      return launch_agent(self._config, self._entrypoint, list(args))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
      raise ChildFailedError(
  torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  ============================================================
  train_flux_lora_deepspeed.py FAILED
  ------------------------------------------------------------
  Failures:
  [1]:
    time      : 2025-01-20_14:21:48
    host      : blwhpx-ThinkStation-PX
    rank      : 1 (local_rank: 1)
    exitcode  : 1 (pid: 1575913)
    error_file: <N/A>
    traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  [2]:
    time      : 2025-01-20_14:21:48
    host      : blwhpx-ThinkStation-PX
    rank      : 2 (local_rank: 2)
    exitcode  : 1 (pid: 1575914)
    error_file: <N/A>
    traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  ------------------------------------------------------------
  Root Cause (first observed failure):
  [0]:
    time      : 2025-01-20_14:21:48
    host      : blwhpx-ThinkStation-PX
    rank      : 0 (local_rank: 0)
    exitcode  : 1 (pid: 1575912)
    error_file: <N/A>
    traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  ============================================================
 ```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM while training lora. #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM while training lora. #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions