From 3698f8391d57fbdb34470d1bbadb6f5b27c1e69c Mon Sep 17 00:00:00 2001
From: DerekLiu35 <derekliu2021@gmail.com>
Date: Tue, 10 Jun 2025 04:07:56 +0200
Subject: [PATCH 1/4] add bnb + torch.compile

---
 diffusers-quantization.md | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)
diff --git a/diffusers-quantization.md b/diffusers-quantization.md
index 44be461d19..ebd7cfbe07 100644
--- a/diffusers-quantization.md
+++ b/diffusers-quantization.md
@@ -450,7 +450,7 @@ For more information check out the [Layerwise casting docs](https://huggingface.
 
 Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. Let's explore CPU offloading, group offloading, and `torch.compile`. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).
 
-> **Note:** At the time of writing, bnb + `torch.compile` also works if bnb is installed from source and using pytorch nightly or with fullgraph=False.
+> **Note:** At the time of writing, bnb + `torch.compile` works if bnb is installed from source and using pytorch nightly or with fullgraph=False.
 
 <details>
 <summary>Example (Flux-dev with BnB 4-bit + enable_model_cpu_offload):</summary>
@@ -556,6 +556,33 @@ pipe = FluxPipeline.from_pretrained(
 > **Note:** `torch.compile` can introduce subtle numerical differences, leading to changes in image output
 </details>
 
+<details>
+<summary>Example (Flux-dev with BnB 4-bit  torch.compile):</summary>
+
+```diff
+ import torch
+ from diffusers import FluxPipeline
+ from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers.quantizers import PipelineQuantizationConfig
+ from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ 
+ model_id = "black-forest-labs/FLUX.1-dev"
+ pipeline_quant_config = PipelineQuantizationConfig(
+     quant_mapping={
+         "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+         "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+     }
+ )
+ pipe = FluxPipeline.from_pretrained(
+     model_id,
+     quantization_config=pipeline_quant_config,
+     torch_dtype=torch.bfloat16
+ ).to("cuda")
++
++pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
+```
+</details>
+
 **torch.compile**: Another complementary approach is to accelerate the execution of your model with PyTorch 2.x’s torch.compile() feature. Compiling the model doesn’t directly lower memory, but it can significantly speed up inference. PyTorch 2.0’s compile (Torch Dynamo) works by tracing and optimizing the model graph ahead-of-time.
 
 **torchao + `torch.compile`**: 
@@ -565,6 +592,13 @@ pipe = FluxPipeline.from_pretrained(
 | int8_weight_only              | 17.020 GB            | 22.473 GB   | 8 seconds     | ~851 seconds          |
 | float8_weight_only            | 17.016 GB            | 22.115 GB   | 8 seconds     | ~545 seconds          |
 
+**bitsandbytes + `torch.compile`**: **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8)
+
+| `bitsandbytes` 4-bit | Peak Memory | Inference Time |
+|-----|-----|------|
+| **Without `torch.compile`** | 14.968 GB | 12.695 seconds |
+| **With `torch.compile`** | 14.968 GB | 8.674 seconds |
+
 Explore some benchmarking results here:
 
 <iframe

From 2f3ec527c5378ff4f46415a1214e0d251843e497 Mon Sep 17 00:00:00 2001
From: DerekLiu35 <91234588+DerekLiu35@users.noreply.github.com>
Date: Mon, 9 Jun 2025 22:30:16 -0400
Subject: [PATCH 2/4] Update diffusers-quantization.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
---
 diffusers-quantization.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/diffusers-quantization.md b/diffusers-quantization.md
index ebd7cfbe07..85662d1f85 100644
--- a/diffusers-quantization.md
+++ b/diffusers-quantization.md
@@ -567,12 +567,11 @@ pipe = FluxPipeline.from_pretrained(
  from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
  
  model_id = "black-forest-labs/FLUX.1-dev"
- pipeline_quant_config = PipelineQuantizationConfig(
-     quant_mapping={
-         "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
-         "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
-     }
- )
+pipeline_quant_config = PipelineQuantizationConfig(
+    quant_backend="bitsandbytes_4bit",
+    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
+    components_to_quantize=["transformer", "text_encoder_2"]
+)
  pipe = FluxPipeline.from_pretrained(
      model_id,
      quantization_config=pipeline_quant_config,

From 4a541432f60a56a79dbae8108612c2e9d572120c Mon Sep 17 00:00:00 2001
From: DerekLiu35 <derekliu2021@gmail.com>
Date: Tue, 10 Jun 2025 04:40:13 +0200
Subject: [PATCH 3/4] apply suggestions from code review

---
 diffusers-quantization.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/diffusers-quantization.md b/diffusers-quantization.md
index 85662d1f85..99600aca86 100644
--- a/diffusers-quantization.md
+++ b/diffusers-quantization.md
@@ -450,7 +450,8 @@ For more information check out the [Layerwise casting docs](https://huggingface.
 
 Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. Let's explore CPU offloading, group offloading, and `torch.compile`. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).
 
-> **Note:** At the time of writing, bnb + `torch.compile` works if bnb is installed from source and using pytorch nightly or with fullgraph=False.
+> **Note:** At the time of writing, bnb + `torch.compile` works if bitsandbytes>=0.46.0 and using pytorch nightly or with fullgraph=False.
+
 
 <details>
 <summary>Example (Flux-dev with BnB 4-bit + enable_model_cpu_offload):</summary>
@@ -591,7 +592,7 @@ pipeline_quant_config = PipelineQuantizationConfig(
 | int8_weight_only              | 17.020 GB            | 22.473 GB   | 8 seconds     | ~851 seconds          |
 | float8_weight_only            | 17.016 GB            | 22.115 GB   | 8 seconds     | ~545 seconds          |
 
-**bitsandbytes + `torch.compile`**: **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8)
+**bitsandbytes + `torch.compile`**: These benchmarks were obtained on a A100. **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8). 
 
 | `bitsandbytes` 4-bit | Peak Memory | Inference Time |
 |-----|-----|------|

From b8dd51ba2adbd5884c52e1f085222b17eb84d047 Mon Sep 17 00:00:00 2001
From: DerekLiu35 <derekliu2021@gmail.com>
Date: Tue, 10 Jun 2025 04:56:18 +0200
Subject: [PATCH 4/4] fix

---
 diffusers-quantization.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diffusers-quantization.md b/diffusers-quantization.md
index 99600aca86..4c690115c1 100644
--- a/diffusers-quantization.md
+++ b/diffusers-quantization.md
@@ -592,7 +592,7 @@ pipeline_quant_config = PipelineQuantizationConfig(
 | int8_weight_only              | 17.020 GB            | 22.473 GB   | 8 seconds     | ~851 seconds          |
 | float8_weight_only            | 17.016 GB            | 22.115 GB   | 8 seconds     | ~545 seconds          |
 
-**bitsandbytes + `torch.compile`**: These benchmarks were obtained on a A100. **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8). 
+**bitsandbytes + `torch.compile`**: These benchmarks were obtained on a H100. **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8). 
 
 | `bitsandbytes` 4-bit | Peak Memory | Inference Time |
 |-----|-----|------|