fix mask with flash attn #2

Green-Sky · 2025-06-03T13:43:06Z

pad requirement was 2816 for chroma/flux mask.

ggml_extend.hpp

stduhpf · 2025-06-03T14:07:27Z

@Green-Sky I think something is wrong somewhere. I get black images, both on Vulkan and CPU (using updated ggml for both) when using masking.

Green-Sky · 2025-06-03T14:10:42Z

With flash attention or without?

stduhpf · 2025-06-03T14:15:07Z

With flash attention or without?

With flash attention. Without masking, fa works fine, and without fa, masking doesn't cause issues.

Green-Sky · 2025-06-03T14:15:10Z

Oh, updated ggml, hmm. time to look at all changes?
edit: ah

Green-Sky · 2025-06-03T14:17:39Z

cuda:

mask:

mask fa:

fa:

stduhpf · 2025-06-03T14:31:55Z

On outdated (current) GGML, fa doesn't work at all on Vulkan, but CPU results are still black. Can you reproduce it (with low resolution , 1step and cfg_scale 1)?

Green-Sky · 2025-06-03T14:55:03Z

On outdated (current) GGML, fa doesn't work at all on Vulkan, but CPU results are still black. Can you reproduce it (with low resolution , 1step and cfg_scale 1)?

How small? because 64x64 just crashes.

stduhpf · 2025-06-03T14:56:20Z

How small? because 64x64 just crashes.

I meant the standard 512x512.

Green-Sky · 2025-06-03T15:17:11Z

1 step, 1cfg, 512x512 runs:

cuda

cuda nomask nofa

cuda nomask fa

cuda mask nofa

cuda mask fa

cpu

cpu nomask nofa

cpu nomask fa

cpu mask nofa

cpu mask fa

this certainly does not look good for fa on cpu

stduhpf · 2025-06-03T15:19:00Z

Interesting, cpu mask fa is pitch black for me.

stduhpf · 2025-06-03T15:26:15Z

I'll suppose its an upstream GGML problem, let's merge it for Cuda users

Green-Sky · 2025-06-03T15:36:07Z

asan is clean too. (beside the memory leak, which might be worth fixing btw)

Option:
    n_threads:         8
    mode:              txt2img
    model_path:
    wtype:             unspecified
    clip_l_path:
    clip_g_path:
    t5xxl_path:        models/flux-extra/t5xxl_q8_0.gguf
    diffusion_model_path:   /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf
    vae_path:          models/flux-extra/ae-f16.gguf
    taesd_path:
    esrgan_path:
    controlnet_path:
    embeddings_path:
    stacked_id_embeddings_path:
    input_id_images_path:
    style ratio:       20.00
    normalize input image :  false
    output_path:       output.png
    init_img:
    mask_img:
    control_image:
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    diffusion flash attention:true
    strength(control): 0.90
    prompt:            Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.
    negative_prompt:   low quality, ugly, unfinished, out of focus, blury, text, sketch, cartoony, bad anatomy, amateurish
    min_cfg:           1.00
    cfg_scale:         1.00
    slg_scale:         0.00
    guidance:          0.00
    eta:               0.00
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      1
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        true
    upscale_repeats:   1
System Info:
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:188  - Using CPU backend
[INFO ] stable-diffusion.cpp:218  - loading t5xxl from 'models/flux-extra/t5xxl_q8_0.gguf'
[INFO ] model.cpp:905  - load models/flux-extra/t5xxl_q8_0.gguf using gguf format
[DEBUG] model.cpp:922  - init from 'models/flux-extra/t5xxl_q8_0.gguf'
[INFO ] stable-diffusion.cpp:225  - loading diffusion model from '/run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf'
[INFO ] model.cpp:905  - load /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf using gguf format
[DEBUG] model.cpp:922  - init from '/run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf'
[INFO ] stable-diffusion.cpp:232  - loading vae from 'models/flux-extra/ae-f16.gguf'
[INFO ] model.cpp:905  - load models/flux-extra/ae-f16.gguf using gguf format
[DEBUG] model.cpp:922  - init from 'models/flux-extra/ae-f16.gguf'
[INFO ] stable-diffusion.cpp:244  - Version: Flux
[INFO ] stable-diffusion.cpp:277  - Weight type:                 q8_0
[INFO ] stable-diffusion.cpp:278  - Conditioner weight type:     q8_0
[INFO ] stable-diffusion.cpp:279  - Diffusion model weight type: q4_K
[INFO ] stable-diffusion.cpp:280  - VAE weight type:             f16
[DEBUG] stable-diffusion.cpp:282  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:328  - Using flash attention in the diffusion model
[INFO ] flux.hpp:1050 - Flux blocks: 19 double, 38 single
[INFO ] flux.hpp:1052 - Using pruned modulation (Chroma)
[DEBUG] ggml_extend.hpp:1186 - t5 params backend buffer size =  4826.11 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1186 - flux params backend buffer size =  4824.80 MB(RAM) (643 tensors)
[DEBUG] ggml_extend.hpp:1186 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:430  - loading weights
[DEBUG] model.cpp:1727 - loading tensors from models/flux-extra/t5xxl_q8_0.gguf
  |=========>                                        | 217/1107 - 0.00it/s[INFO ] model.cpp:1897 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | q8_0 | 2 [4096, 32128, 1, 1, 1]' in model file
  |=========>                                        | 220/1107 - 9.52it/s[DEBUG] model.cpp:1727 - loading tensors from /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf
  |======================================>           | 863/1107 - 26.32it/s[DEBUG] model.cpp:1727 - loading tensors from models/flux-extra/ae-f16.gguf
  |=============================================>    | 1001/1107 - 200.00it/s[INFO ] stable-diffusion.cpp:514  - total params memory size = 9745.49MB (VRAM 0.00MB, RAM 9745.49MB): clip 4826.11MB(RAM), unet 4824.80MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:533  - loading model from '' completed, taking 9.19s
[INFO ] stable-diffusion.cpp:554  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:611  - finished loaded file
[DEBUG] stable-diffusion.cpp:1559 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1252 - prompt after extract and remove lora: "Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus."
[INFO ] stable-diffusion.cpp:701  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1257 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1267 - parse 'Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.' to [['Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.', 1], ]
[DEBUG] t5.hpp:398  - token length: 512
[DEBUG] ggml_extend.hpp:1138 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1354 - computing condition graph completed, taking 20733 ms
[INFO ] stable-diffusion.cpp:1390 - get_learned_condition completed, taking 20745 ms
[INFO ] stable-diffusion.cpp:1413 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1450 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:819  - Sample
[DEBUG] flux.hpp:1099 - Forcing guidance to 0 for chroma model (SD_CHROMA_ENABLE_GUIDANCE env variable to "ON" to enable)
[DEBUG] ggml_extend.hpp:1138 - flux compute buffer size: 324.34 MB(RAM)
[DEBUG] flux.hpp:1099 - Forcing guidance to 0 for chroma model (SD_CHROMA_ENABLE_GUIDANCE env variable to "ON" to enable)
  |==================================================| 1/1 - 135.67s/it
[INFO ] stable-diffusion.cpp:1489 - sampling completed, taking 136.02s
[INFO ] stable-diffusion.cpp:1497 - generating 1 latent images completed, taking 137.70s
[INFO ] stable-diffusion.cpp:1500 - decoding 1 latents
[DEBUG] ggml_extend.hpp:616  - tile work buffer size: 0.81 MB
[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
[INFO ] ggml_extend.hpp:630  - processing 16 tiles
  |===>                                              | 1/16 - 0.00it/s[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |===>                                              | 1/16 - 7.34s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |======>                                           | 2/16 - 7.40s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=========>                                        | 3/16 - 7.31s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |============>                                     | 4/16 - 7.42s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |===============>                                  | 5/16 - 7.34s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |==================>                               | 6/16 - 7.28s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=====================>                            | 7/16 - 7.37s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=========================>                        | 8/16 - 7.33s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |==================================================| 16/16 - 7.27s/it
[DEBUG] stable-diffusion.cpp:1101 - computing vae [mode: DECODE] graph completed, taking 73.30s
[INFO ] stable-diffusion.cpp:1510 - latent 1 decoded, taking 73.30s
[INFO ] stable-diffusion.cpp:1514 - decode_first_stage completed, taking 73.30s
[INFO ] stable-diffusion.cpp:1639 - txt2img completed in 231.75s
save result PNG image to 'output.png'

=================================================================
==595720==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f27920dc04f in __interceptor_malloc (/nix/store/mhd0rk497xm0xnip7262xdw9bylvzh99-gcc-13.3.0-lib/lib/libasan.so.8+0xdc04f)
    #1 0x534d8e in ggml_malloc ggml/src/ggml.c:321
    #2 0x6020076c4c97  (<unknown module>)

Indirect leak of 789504 byte(s) in 1 object(s) allocated from:
    #0 0x7f27920db66d in posix_memalign (/nix/store/mhd0rk497xm0xnip7262xdw9bylvzh99-gcc-13.3.0-lib/lib/libasan.so.8+0xdb66d)
    #1 0x534fe6 in ggml_aligned_malloc ggml/src/ggml.c:278

SUMMARY: AddressSanitizer: 789544 byte(s) leaked in 2 allocation(s).

Green-Sky · 2025-06-03T17:17:20Z

Looks like upstream does not test values in the mask for flash attention, only the fact that a mask exists (with nothing masked).
https://github.com/ggml-org/ggml/blob/988abe2ab374544af09d42aa7491dceaf6be04a1/tests/test-backend-ops.cpp#L3347-L3351

Green-Sky mentioned this pull request Jun 3, 2025

Chroma support (pruned Flux model) leejet/stable-diffusion.cpp#696

Merged

stduhpf approved these changes Jun 3, 2025

View reviewed changes

ggml_extend.hpp Outdated Show resolved Hide resolved

fix mask with flash attn

3238fe3

Green-Sky force-pushed the chroma_fa_fix branch from 8a135ae to 3238fe3 Compare June 3, 2025 13:56

stduhpf merged commit 7951daa into stduhpf:chroma-support Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix mask with flash attn #2

fix mask with flash attn #2

Uh oh!

Green-Sky commented Jun 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025 •

edited

Loading

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025 •

edited

Loading

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

Uh oh!

fix mask with flash attn #2

fix mask with flash attn #2

Uh oh!

Conversation

Green-Sky commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Jun 3, 2025

cuda:

mask:

mask fa:

fa:

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuda

cuda nomask nofa

cuda nomask fa

cuda mask nofa

cuda mask fa

cpu

cpu nomask nofa

cpu nomask fa

cpu mask nofa

cpu mask fa

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

stduhpf commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

Green-Sky commented Jun 3, 2025

Uh oh!

Uh oh!

Green-Sky commented Jun 3, 2025 •

edited

Loading

Green-Sky commented Jun 3, 2025 •

edited

Loading

Green-Sky commented Jun 3, 2025 •

edited

Loading