Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` #3796

Qubitium · 2025-09-26T08:19:15Z

What does this PR do?

Fix disk_offload() api causing torch.cuda.empty_cache() to be called when the module origin device is cpu and offloaded to meta (disk).

Secondarily this also resolves a performance issue as torch.cuda.empty_cache() is slow and calling it with no effect in a forwardig env where modules are dynamically (manually) offloaded is suboptimal.

# nn.Module (Linear)
model: TritonV2QuantLinear  (P=0 B=2.25M) [cpu | mixed[int32, float16] | ~8.32MB]
      buffer: g_idx  shape=(2048,) dtype=int32 device=cpu ~8.00KB
      buffer: scales  shape=(16, 8192) dtype=float16 device=cpu ~256.00KB
      buffer: qweight  shape=(256, 8192) dtype=int32 device=cpu ~8.00MB
      buffer: qzeros  shape=(16, 1024) dtype=int32 device=cpu ~64.00KB

Given above nn.Module (linear) which is on cpu and following this call, I did not expect accelerate code paths to call anything cuda related. This (in-directly) triggered a cuda assert error in my GIL=0 env with multiple gpu and threads. I probably have thread ctx bug somehwere above the code but the main objective is that in this scenerio, torch.cuda.empty_cache() should never be called by disk_offload paths.

    _ = disk_offload(
        module, # <--- see above ascii print of module
        offload_dir=f"{disk_path}/{name}",
        offload_buffers=True,  # needed for buffers
        execution_device=torch.device("cpu"),
    )

Stacktrace: Please note the crash is not caused by accelerate. The stack shows the paths that triggered the invalid torch.cuda.empty_cache() call.

Traceback (most recent call last):
 File "/root/GPTQModel/gptqmodel/utils/threads.py", line 33, in _runner
   return fn()
 File "/root/GPTQModel/gptqmodel/looper/module_looper.py", line 553, in finalize_module
   offload_to_disk(
   ~~~~~~~~~~~~~~~^
       model=self.gptq_model.model,
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       module=self.gptq_model.model.get_submodule(module.full_name),
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       disk_path=self.gptq_model.quantize_config.offload_to_disk_path,
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   )
   ^
 File "/root/GPTQModel/gptqmodel/utils/offload.py", line 80, in offload_to_disk
   _offload_disk(module=module, name=full_name, disk_path=disk_path)
   ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/root/GPTQModel/gptqmodel/utils/offload.py", line 108, in _offload_disk
   _ = disk_offload(
       module,
   ...<3 lines>...
       execution_device=m_device,
   )
 File "/root/accelerate/src/accelerate/big_modeling.py", line 297, in disk_offload
   attach_align_device_hook(
   ~~~~~~~~~~~~~~~~~~~~~~~~^
       model,
       ^^^^^^
   ...<4 lines>...
       preload_module_classes=preload_module_classes,
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   )
   ^
 File "/root/accelerate/src/accelerate/hooks.py", line 521, in attach_align_device_hook
   add_hook_to_module(module, hook, append=True)
   ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/root/accelerate/src/accelerate/hooks.py", line 166, in add_hook_to_module
   module = hook.init_hook(module)
 File "/root/accelerate/src/accelerate/hooks.py", line 111, in init_hook
   module = hook.init_hook(module)
 File "/root/accelerate/src/accelerate/hooks.py", line 313, in init_hook
   set_module_tensor_to_device(module, name, "meta")
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
 File "/root/accelerate/src/accelerate/utils/modeling.py", line 408, in set_module_tensor_to_device
   clear_device_cache()
   ~~~~~~~~~~~~~~~~~~^^
 File "/root/accelerate/src/accelerate/utils/memory.py", line 65, in clear_device_cache
   torch.cuda.empty_cache()
   ~~~~~~~~~~~~~~~~~~~~~~^^
 File "/root/vm313t/lib/python3.13t/site-packages/torch/cuda/memory.py", line 224, in empty_cache
   torch._C._cuda_emptyCache()
   ~~~~~~~~~~~~~~~~~~~~~~~~~^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Who can review?

@SunMarc @zach-huggingface @BenjaminBossan

… disk `meta`

SunMarc

Thanks a lot and nice report btw !

HuggingFaceDocBuilderDev · 2025-09-26T15:20:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fix (skip) cache flush when original device is cpu and offloaded to…

5a95612

… disk `meta`

SunMarc approved these changes Sep 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` #3796

Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` #3796

Qubitium commented Sep 26, 2025 •

edited

Loading

Uh oh!

SunMarc left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 26, 2025

Uh oh!

Uh oh!

Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta #3796

Are you sure you want to change the base?

Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta #3796

Conversation

Qubitium commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 26, 2025

Uh oh!

Uh oh!

Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` #3796

Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` #3796

Qubitium commented Sep 26, 2025 •

edited

Loading