Is it possible to convert using a 3090? I'm getting an OOM error.
Command line: convert_to_quant -i gemma_3_12B_it_heretic_fp8_e4m3fn.safetensors --int8 --scaling_mode tensor --block_size 128 --comfy_quant
Target format: INT8 (block-wise quantization)
INT8 Range: [-127, 127]
Comfy quantization mode enabled: Using comfy_quant layer names and settings.
Loading 629 tensors from source file...
Loading tensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<00:00, 54414.18it/s]
[MINIMAL] Scanning model and generating simulated calibration data...
Simulated calibration data generated.
Found 337 weight tensors to potentially process.
(1/337) Processing (INT8): model.embed_tokens.weight
[pinned_transfer] Pinned memory active - faster GPU transfers enabled
- Tensor shape: [262208, 3840], Max rank: 3840. Using k=768 components.
- Trying svd_lowrank
Traceback (most recent call last):
File "/home/comfy-env/bin/convert_to_quant", line 6, in
sys.exit(main())
^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/cli/main.py", line 1204, in main
convert_to_fp8_scaled(
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/utils/logging.py", line 82, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/formats/fp8_conversion.py", line 490, in convert_to_fp8_scaled
q_tensor, dequant_s, dequant_w, extra_tensors = converter.convert(original_tensor, key=key, depth=depth)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 759, in convert
qdata, scale, dequantized = self._convert_int8_tensorwise(W_float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 854, in _convert_int8_tensorwise
qdata, scale = self._optimize_int8_tensorwise_learned_rounding(W_float32, qdata, scale)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 898, in _optimize_int8_tensorwise_learned_rounding
final_tensor_scaled = self._optimize_original(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 570, in _optimize_original
error = current_dq - W_float32
~~~~~~~~~~~^~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.75 GiB. GPU 0 has a total capacity of 23.55 GiB of which 1.83 GiB is free. Including non-PyTorch memory, this process has 21.51 GiB memory in use. Of the allocated memory 21.00 GiB is allocated by PyTorch, and 171.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Is it possible to convert using a 3090? I'm getting an OOM error.
Command line: convert_to_quant -i gemma_3_12B_it_heretic_fp8_e4m3fn.safetensors --int8 --scaling_mode tensor --block_size 128 --comfy_quant
Target format: INT8 (block-wise quantization)
INT8 Range: [-127, 127]
Comfy quantization mode enabled: Using comfy_quant layer names and settings.
Loading 629 tensors from source file...
Loading tensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<00:00, 54414.18it/s]
[MINIMAL] Scanning model and generating simulated calibration data...
Simulated calibration data generated.
Found 337 weight tensors to potentially process.
(1/337) Processing (INT8): model.embed_tokens.weight
[pinned_transfer] Pinned memory active - faster GPU transfers enabled
- Tensor shape: [262208, 3840], Max rank: 3840. Using k=768 components.
- Trying svd_lowrank
Traceback (most recent call last):
File "/home/comfy-env/bin/convert_to_quant", line 6, in
sys.exit(main())
^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/cli/main.py", line 1204, in main
convert_to_fp8_scaled(
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/utils/logging.py", line 82, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/formats/fp8_conversion.py", line 490, in convert_to_fp8_scaled
q_tensor, dequant_s, dequant_w, extra_tensors = converter.convert(original_tensor, key=key, depth=depth)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 759, in convert
qdata, scale, dequantized = self._convert_int8_tensorwise(W_float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 854, in _convert_int8_tensorwise
qdata, scale = self._optimize_int8_tensorwise_learned_rounding(W_float32, qdata, scale)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 898, in _optimize_int8_tensorwise_learned_rounding
final_tensor_scaled = self._optimize_original(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/comfy-env/lib/python3.12/site-packages/convert_to_quant/converters/learned_rounding.py", line 570, in _optimize_original
error = current_dq - W_float32
~~~~~~~~~~~^~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.75 GiB. GPU 0 has a total capacity of 23.55 GiB of which 1.83 GiB is free. Including non-PyTorch memory, this process has 21.51 GiB memory in use. Of the allocated memory 21.00 GiB is allocated by PyTorch, and 171.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)