Skip to content

fix mask with flash attn #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 3, 2025

Conversation

Green-Sky
Copy link

@Green-Sky Green-Sky commented Jun 3, 2025

pad requirement was 2816 for chroma/flux mask.

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

@Green-Sky I think something is wrong somewhere. I get black images, both on Vulkan and CPU (using updated ggml for both) when using masking.

@Green-Sky
Copy link
Author

With flash attention or without?

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

With flash attention or without?

With flash attention. Without masking, fa works fine, and without fa, masking doesn't cause issues.

@Green-Sky
Copy link
Author

Green-Sky commented Jun 3, 2025

Oh, updated ggml, hmm. time to look at all changes?
edit: ah

@Green-Sky
Copy link
Author

cuda:

mask:

mask_output

mask fa:

fa_mask_output

fa:

fa_output

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

On outdated (current) GGML, fa doesn't work at all on Vulkan, but CPU results are still black. Can you reproduce it (with low resolution , 1step and cfg_scale 1)?

@Green-Sky
Copy link
Author

On outdated (current) GGML, fa doesn't work at all on Vulkan, but CPU results are still black. Can you reproduce it (with low resolution , 1step and cfg_scale 1)?

How small? because 64x64 just crashes.

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

How small? because 64x64 just crashes.

I meant the standard 512x512.

@Green-Sky
Copy link
Author

Green-Sky commented Jun 3, 2025

1 step, 1cfg, 512x512 runs:

cuda

cuda nomask nofa

cuda_nomask_nofa

cuda nomask fa

cuda_nomask_fa

cuda mask nofa

cuda_mask

cuda mask fa

cuda_mask_fa

cpu

cpu nomask nofa

cpu_nomask_nofa

cpu nomask fa

cpu_nomask_fa

cpu mask nofa

cpu_mask_nofa

cpu mask fa

cpu_mask_fa

this certainly does not look good for fa on cpu

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

Interesting, cpu mask fa is pitch black for me.

@stduhpf
Copy link
Owner

stduhpf commented Jun 3, 2025

I'll suppose its an upstream GGML problem, let's merge it for Cuda users

@stduhpf stduhpf merged commit 7951daa into stduhpf:chroma-support Jun 3, 2025
@Green-Sky
Copy link
Author

asan is clean too. (beside the memory leak, which might be worth fixing btw)

Option:
    n_threads:         8
    mode:              txt2img
    model_path:
    wtype:             unspecified
    clip_l_path:
    clip_g_path:
    t5xxl_path:        models/flux-extra/t5xxl_q8_0.gguf
    diffusion_model_path:   /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf
    vae_path:          models/flux-extra/ae-f16.gguf
    taesd_path:
    esrgan_path:
    controlnet_path:
    embeddings_path:
    stacked_id_embeddings_path:
    input_id_images_path:
    style ratio:       20.00
    normalize input image :  false
    output_path:       output.png
    init_img:
    mask_img:
    control_image:
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    diffusion flash attention:true
    strength(control): 0.90
    prompt:            Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.
    negative_prompt:   low quality, ugly, unfinished, out of focus, blury, text, sketch, cartoony, bad anatomy, amateurish
    min_cfg:           1.00
    cfg_scale:         1.00
    slg_scale:         0.00
    guidance:          0.00
    eta:               0.00
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      1
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        true
    upscale_repeats:   1
System Info:
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:188  - Using CPU backend
[INFO ] stable-diffusion.cpp:218  - loading t5xxl from 'models/flux-extra/t5xxl_q8_0.gguf'
[INFO ] model.cpp:905  - load models/flux-extra/t5xxl_q8_0.gguf using gguf format
[DEBUG] model.cpp:922  - init from 'models/flux-extra/t5xxl_q8_0.gguf'
[INFO ] stable-diffusion.cpp:225  - loading diffusion model from '/run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf'
[INFO ] model.cpp:905  - load /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf using gguf format
[DEBUG] model.cpp:922  - init from '/run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf'
[INFO ] stable-diffusion.cpp:232  - loading vae from 'models/flux-extra/ae-f16.gguf'
[INFO ] model.cpp:905  - load models/flux-extra/ae-f16.gguf using gguf format
[DEBUG] model.cpp:922  - init from 'models/flux-extra/ae-f16.gguf'
[INFO ] stable-diffusion.cpp:244  - Version: Flux
[INFO ] stable-diffusion.cpp:277  - Weight type:                 q8_0
[INFO ] stable-diffusion.cpp:278  - Conditioner weight type:     q8_0
[INFO ] stable-diffusion.cpp:279  - Diffusion model weight type: q4_K
[INFO ] stable-diffusion.cpp:280  - VAE weight type:             f16
[DEBUG] stable-diffusion.cpp:282  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:328  - Using flash attention in the diffusion model
[INFO ] flux.hpp:1050 - Flux blocks: 19 double, 38 single
[INFO ] flux.hpp:1052 - Using pruned modulation (Chroma)
[DEBUG] ggml_extend.hpp:1186 - t5 params backend buffer size =  4826.11 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1186 - flux params backend buffer size =  4824.80 MB(RAM) (643 tensors)
[DEBUG] ggml_extend.hpp:1186 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:430  - loading weights
[DEBUG] model.cpp:1727 - loading tensors from models/flux-extra/t5xxl_q8_0.gguf
  |=========>                                        | 217/1107 - 0.00it/s[INFO ] model.cpp:1897 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | q8_0 | 2 [4096, 32128, 1, 1, 1]' in model file
  |=========>                                        | 220/1107 - 9.52it/s[DEBUG] model.cpp:1727 - loading tensors from /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/chroma-unlocked-v33-q4_k.gguf
  |======================================>           | 863/1107 - 26.32it/s[DEBUG] model.cpp:1727 - loading tensors from models/flux-extra/ae-f16.gguf
  |=============================================>    | 1001/1107 - 200.00it/s[INFO ] stable-diffusion.cpp:514  - total params memory size = 9745.49MB (VRAM 0.00MB, RAM 9745.49MB): clip 4826.11MB(RAM), unet 4824.80MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:533  - loading model from '' completed, taking 9.19s
[INFO ] stable-diffusion.cpp:554  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:611  - finished loaded file
[DEBUG] stable-diffusion.cpp:1559 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1252 - prompt after extract and remove lora: "Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus."
[INFO ] stable-diffusion.cpp:701  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1257 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1267 - parse 'Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.' to [['Photograph of the Alps. Vivid summer afternoon. Everything is perfectly in focus.', 1], ]
[DEBUG] t5.hpp:398  - token length: 512
[DEBUG] ggml_extend.hpp:1138 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1354 - computing condition graph completed, taking 20733 ms
[INFO ] stable-diffusion.cpp:1390 - get_learned_condition completed, taking 20745 ms
[INFO ] stable-diffusion.cpp:1413 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1450 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:819  - Sample
[DEBUG] flux.hpp:1099 - Forcing guidance to 0 for chroma model (SD_CHROMA_ENABLE_GUIDANCE env variable to "ON" to enable)
[DEBUG] ggml_extend.hpp:1138 - flux compute buffer size: 324.34 MB(RAM)
[DEBUG] flux.hpp:1099 - Forcing guidance to 0 for chroma model (SD_CHROMA_ENABLE_GUIDANCE env variable to "ON" to enable)
  |==================================================| 1/1 - 135.67s/it
[INFO ] stable-diffusion.cpp:1489 - sampling completed, taking 136.02s
[INFO ] stable-diffusion.cpp:1497 - generating 1 latent images completed, taking 137.70s
[INFO ] stable-diffusion.cpp:1500 - decoding 1 latents
[DEBUG] ggml_extend.hpp:616  - tile work buffer size: 0.81 MB
[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
[INFO ] ggml_extend.hpp:630  - processing 16 tiles
  |===>                                              | 1/16 - 0.00it/s[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |===>                                              | 1/16 - 7.34s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |======>                                           | 2/16 - 7.40s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=========>                                        | 3/16 - 7.31s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |============>                                     | 4/16 - 7.42s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |===============>                                  | 5/16 - 7.34s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |==================>                               | 6/16 - 7.28s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=====================>                            | 7/16 - 7.37s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |=========================>                        | 8/16 - 7.33s/it[DEBUG] ggml_extend.hpp:1138 - vae compute buffer size: 416.00 MB(RAM)
  |==================================================| 16/16 - 7.27s/it
[DEBUG] stable-diffusion.cpp:1101 - computing vae [mode: DECODE] graph completed, taking 73.30s
[INFO ] stable-diffusion.cpp:1510 - latent 1 decoded, taking 73.30s
[INFO ] stable-diffusion.cpp:1514 - decode_first_stage completed, taking 73.30s
[INFO ] stable-diffusion.cpp:1639 - txt2img completed in 231.75s
save result PNG image to 'output.png'

=================================================================
==595720==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f27920dc04f in __interceptor_malloc (/nix/store/mhd0rk497xm0xnip7262xdw9bylvzh99-gcc-13.3.0-lib/lib/libasan.so.8+0xdc04f)
    #1 0x534d8e in ggml_malloc ggml/src/ggml.c:321
    #2 0x6020076c4c97  (<unknown module>)

Indirect leak of 789504 byte(s) in 1 object(s) allocated from:
    #0 0x7f27920db66d in posix_memalign (/nix/store/mhd0rk497xm0xnip7262xdw9bylvzh99-gcc-13.3.0-lib/lib/libasan.so.8+0xdb66d)
    #1 0x534fe6 in ggml_aligned_malloc ggml/src/ggml.c:278

SUMMARY: AddressSanitizer: 789544 byte(s) leaked in 2 allocation(s).

@Green-Sky
Copy link
Author

Looks like upstream does not test values in the mask for flash attention, only the fact that a mask exists (with nothing masked).
https://github.com/ggml-org/ggml/blob/988abe2ab374544af09d42aa7491dceaf6be04a1/tests/test-backend-ops.cpp#L3347-L3351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants