llama : add high-throughput mode #14363

ggerganov · 2025-06-24T12:37:21Z

Overview

Improve multi-sequence decoding performance by avoiding the cross-sequence attention compute.

Note

The functionality currently requires the LLAMA_SET_ROWS from #14285 and support for ggml_soft_max_ext() / ggml_flash_attn_ext() broadcast (#14435)

Description

One significant drawback of the unified KV cache is that it leads to performing a lot of unnecessary computation in the attention when the unified buffer is shared between many large independent sequences. The reason is that we have to view this buffer continuously and therefore we end up computing large potions of "cross-sequence attention" which we then simply discard.

With this change, we add option to split the unified KV cache buffer into multiple buffers - one for each sequence. This decouples the sequences from each other and improves the performance and memory usage of the attention when more than one sequence is used. To achieve that, when the batch reaches the attention, we split it into multiple "streams":

llama.cpp/src/llama-graph.cpp

Lines 1035 to 1044 in c96c48c

    
           // split the batch into streams if needed 
        
           const auto n_stream = k->ne[3]; 
        
           q = ggml_reshape_4d(ctx0, q, q->ne[0], q->ne[1], q->ne[2]/n_stream, n_stream); 
        
           q = ggml_permute(ctx0, q, 0, 2, 1, 3); 
        
           k = ggml_permute(ctx0, k, 0, 2, 1, 3); 
        
           v = ggml_permute(ctx0, v, 0, 2, 1, 3);

Each stream has its own KV cache buffer and thus no longer "sees" the rest of the other streams - it attends only to the tokens that belong to the same stream.

With this approach we now have 2 modes:

The vanilla "unified" approach which we always used until now - all sequences are assigned to a single stream
The new approach - each sequence is assigned to a separate stream

To enable the new mode, simply add the --attn-streams CLI arg to the llama.cpp tools. It should generally perform better for multi-user or multi-sequence scenarios.

API Changes

Add bool llama_context_params::attn_streams. Default is false

Testing

Define the LLAMA_SET_ROWS=1 environment variable and add the --attn-streams argument:

Qwen 2.5 Coder 3B Q8_0, M2 Ultra

# master
make -j && LLAMA_SET_ROWS=1 ./bin/llama-batched-bench -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -c 133120 -b 2048 -ub 2048 -npp 0,0,512,1024,2048,4096 -ntg 32 -npl 32 -fa

0.00.604.032 I llama_kv_cache_unified:      Metal KV buffer size =  4680.00 MiB
0.00.953.209 I llama_kv_cache_unified: size = 4680.00 MiB (133120 cells,  36 layers, 32 seqs), K (f16): 2340.00 MiB, V (f16): 2340.00 MiB
0.01.016.945 I llama_context:      Metal compute buffer size =  1624.05 MiB
0.01.016.947 I llama_context:        CPU compute buffer size =  1056.05 MiB
0.01.016.947 I llama_context: graph nodes  = 1195
0.01.016.947 I llama_context: graph splits = 2
main: n_kv_max = 133120, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.403 |   729.71 |    1.403 |   729.66 |
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.381 |   741.44 |    1.381 |   741.37 |
|   512 |     32 |   32 |  17408 |    5.320 |  3079.72 |    2.052 |   498.98 |    7.372 |  2361.33 |
|  1024 |     32 |   32 |  33792 |   11.632 |  2817.15 |    2.715 |   377.16 |   14.347 |  2355.40 |
|  2048 |     32 |   32 |  66560 |   27.419 |  2390.20 |    4.052 |   252.73 |   31.470 |  2115.00 |
|  4096 |     32 |   32 | 132096 |   71.549 |  1831.92 |    6.664 |   153.66 |   78.213 |  1688.93 |


# PR (with --attn-streams)
make -j && LLAMA_SET_ROWS=1 ./bin/llama-batched-bench -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -c 133120 -b 2048 -ub 2048 -npp 0,0,512,1024,2048,4096 -ntg 32 -npl 32 -fa -as

0.00.584.467 I llama_kv_cache_unified:      Metal KV buffer size =  4896.00 MiB
0.00.952.799 I llama_kv_cache_unified: size = 4896.00 MiB (  4352 cells,  36 layers, 32/32 seqs), K (f16): 2448.00 MiB, V (f16): 2448.00 MiB
0.01.002.436 I llama_context:      Metal compute buffer size =  1219.00 MiB
0.01.002.438 I llama_context:        CPU compute buffer size =    50.05 MiB
0.01.002.438 I llama_context: graph nodes  = 1231
0.01.002.438 I llama_context: graph splits = 2
main: n_kv_max = 139264, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.339 |   764.92 |    1.339 |   764.85 |
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.332 |   768.79 |    1.332 |   768.69 |
|   512 |     32 |   32 |  17408 |    4.903 |  3341.42 |    1.499 |   682.93 |    6.403 |  2718.84 |
|  1024 |     32 |   32 |  33792 |   10.057 |  3258.12 |    1.569 |   652.46 |   11.627 |  2906.40 |
|  2048 |     32 |   32 |  66560 |   21.213 |  3089.47 |    1.754 |   583.79 |   22.967 |  2898.10 |
|  4096 |     32 |   32 | 132096 |   46.713 |  2805.91 |    2.107 |   486.09 |   48.819 |  2705.81 |

Geamma 3 4B Q8_0, M2 Ultra

# master
make -j && LLAMA_SET_ROWS=1 ./bin/llama-batched-bench -m ../models/gemma-3-4b/ggml-model-q8_0.gguf -c 133120 -b 2048 -ub 2048 -npp 0,0,512,1024,2048,4096 -ntg 32 -npl 32 -fa

0.01.609.907 I llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 133120 cells
0.01.703.014 I llama_kv_cache_unified:      Metal KV buffer size =  2600.00 MiB
0.01.902.274 I llama_kv_cache_unified: size = 2600.00 MiB (133120 cells,   5 layers, 32 seqs), K (f16): 1300.00 MiB, V (f16): 1300.00 MiB
0.01.902.278 I llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 34816 cells
0.02.040.114 I llama_kv_cache_unified:      Metal KV buffer size =  3944.00 MiB
0.02.325.408 I llama_kv_cache_unified: size = 3944.00 MiB ( 34816 cells,  29 layers, 32 seqs), K (f16): 1972.00 MiB, V (f16): 1972.00 MiB
0.02.403.614 I llama_context:      Metal compute buffer size =  2068.00 MiB
0.02.403.616 I llama_context:        CPU compute buffer size =  1332.09 MiB
0.02.403.617 I llama_context: graph nodes  = 1335
0.02.403.617 I llama_context: graph splits = 2
main: n_kv_max = 133120, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.843 |   555.52 |    1.844 |   555.44 |
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.800 |   569.00 |    1.800 |   568.94 |
|   512 |     32 |   32 |  17408 |    6.341 |  2583.88 |    3.601 |   284.33 |    9.942 |  1750.90 |
|  1024 |     32 |   32 |  33792 |   13.832 |  2369.03 |    5.442 |   188.18 |   19.273 |  1753.29 |
|  2048 |     32 |   32 |  66560 |   31.034 |  2111.78 |    6.343 |   161.43 |   37.377 |  1780.77 |
|  4096 |     32 |   32 | 132096 |   69.326 |  1890.65 |    7.456 |   137.33 |   76.783 |  1720.39 |

# PR (with --attn-streams)
make -j && LLAMA_SET_ROWS=1 ./bin/llama-batched-bench -m ../models/gemma-3-4b/ggml-model-q8_0.gguf -c 133120 -b 2048 -ub 2048 -npp 0,0,512,1024,2048,4096 -ntg 32 -npl 32 -fa -as

0.00.505.130 I llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4352 cells
0.00.603.948 I llama_kv_cache_unified:      Metal KV buffer size =  2720.00 MiB
0.00.813.515 I llama_kv_cache_unified: size = 2720.00 MiB (  4352 cells,   5 layers, 32/32 seqs), K (f16): 1360.00 MiB, V (f16): 1360.00 MiB
0.00.813.520 I llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 3072 cells
0.01.198.824 I llama_kv_cache_unified:      Metal KV buffer size = 11136.00 MiB
0.01.986.031 I llama_kv_cache_unified: size = 11136.00 MiB (  3072 cells,  29 layers, 32/32 seqs), K (f16): 5568.00 MiB, V (f16): 5568.00 MiB
0.02.059.335 I llama_context:      Metal compute buffer size =  2068.00 MiB
0.02.059.340 I llama_context:        CPU compute buffer size =    78.09 MiB
0.02.059.340 I llama_context: graph nodes  = 1369
0.02.059.340 I llama_context: graph splits = 2
main: n_kv_max = 139264, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.577 |   649.36 |    1.577 |   649.26 |
|     0 |     32 |   32 |   1024 |    0.000 |     0.00 |    1.568 |   652.99 |    1.568 |   652.86 |
|   512 |     32 |   32 |  17408 |    5.884 |  2784.73 |    1.769 |   578.77 |    7.653 |  2274.73 |
|  1024 |     32 |   32 |  33792 |   12.261 |  2672.46 |    1.874 |   546.44 |   14.135 |  2390.61 |
|  2048 |     32 |   32 |  66560 |   25.831 |  2537.12 |    1.962 |   522.01 |   27.793 |  2394.89 |
|  4096 |     32 |   32 | 132096 |   54.077 |  2423.79 |    2.065 |   496.00 |   56.142 |  2352.90 |

Using a more real-world example with llama-parallel:

# master
make -j && ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 32 -ns 128 -s 1 -c 16384 -fa

# PR
make -j && LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 32 -ns 128 -s 1 -c 4096 -fa -as

TODO

Next PRs

Optimize parallel encoding via (split_equal + padding) and stream split [TAG_NO_CACHE_PAD]
Disable and remove the defrag code when ggml_set_rows() is fully adopted
Add option to llama-parallel to use different RNG seeds for the different clients

JohannesGaessler · 2025-06-24T14:51:26Z

Right now I am comparatively less busy with my PhD so it would be a good time for me to write CUDA code that is still missing, if there is any.

ggerganov · 2025-06-24T15:23:21Z

For now, these are the necessary CUDA changes:

Add ggml_set_rows() support (need PR towards ggml : add ggml_set_rows #14274, can already start implementing this)
Extend ggml_flash_attn_ext() to support n_seq dim if it does not yet:

// old
    // q:    [n_embd_k, n_batch,     n_head,    1]
    // k:    [n_embd_k, n_kv,        n_head_kv, 1]
    // v:    [n_embd_v, n_kv,        n_head_kv, 1] !! not transposed !!
    // mask: [n_kv,     n_batch_pad, 1,         1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
    // res:  [n_embd_v, n_head,      n_batch,   1] !! permuted !!
    GGML_API struct ggml_tensor * ggml_flash_attn_ext(
            ...);

// new - supports `n_seq` dimension:
    // q:    [n_embd_k, n_batch,     n_head,    n_seq]
    // k:    [n_embd_k, n_kv,        n_head_kv, n_seq]
    // v:    [n_embd_v, n_kv,        n_head_kv, n_seq] !! not transposed !!
    // mask: [n_kv,     n_batch_pad, n_seq,         1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
    // res:  [n_embd_v, n_head,      n_batch,   n_seq] !! permuted !!
    GGML_API struct ggml_tensor * ggml_flash_attn_ext(
            ...);

CPU might also need to be extended (not sure yet)

Extend ggml_soft_max_ext to support n_seq dim if it does not yet in a similar way. Also not sure about the CPU state.

Edit: the CPU versions of ggml_soft_max_ext() and ggml_flash_attn_ext() are now correct and can be used as a reference.

compilade · 2025-07-03T14:06:13Z

src/llama-kv-cache-unified.cpp

+        v_cells[s].resize(kv_size);
+    }
+
+    // by default, all sequence ids are mapped to the 0th virtual sequence


I'd like to understand the purpose of virtual sequences.

Is it to make the unified cache not unified?

Should it be a separate cache type instead?

why is n_seq_virt a number and not a bool of whether or not the cache is unified?

Is it to eventually allow n_seq_max % n_seq_virt == 0 for a partially-unified cache?

~~Are virtual sequences intended to be used with other types of caches eventually (e.g. recurrent)?~~

The concept here seems specific to the self-attention KV cache (unless I'm misunderstanding).

Today I found a better term instead of "virtual sequences": "streams". So I'll use "streams" here and will update the code later today or tomorrow.

Is it to make the unified cache not unified?

Roughly yes. The user will be able to select between unified (i.e. single stream) or non-unified (multiple streams). Each mode has advantages in different scenarios. Single stream is good when the sequences share large common prefixes. Multiple streams are good when the sequences are mostly or completely independent from each other.

The first iteration will support 1 stream (i.e. same as master, vanilla unified KV cache) and n_seq_max streams. The latter means that each sequence id is assigned to a separate stream.

In theory, we could assign multiple sequence ids to the same stream to get a partially-unified KV cache, but this would need extra work and it might not have any useful applications. So out of scope for now.

Should it be a separate cache type instead?

There is too much similar logic. Still thinking about it, but most likely it will end up in the same cache type.

The concept here seems specific to the self-attention KV cache (unless I'm misunderstanding)

Yes.

compilade · 2025-07-03T14:24:31Z

src/llama-batch.h

+    // if sequential == true, the tokens in the ubatch will have increasing sequential sequence ids
+    llama_ubatch split_equal(uint32_t n_ubatch, bool sequential);


Why are sequential seq_ids required when virtual sequences are used?

Is it because a contiguous (along the virtual sequence dimension) slice of the KV cache is used?

I wonder if there could be a way to avoid this requirement with ggml_get_rows and/or ggml_mul_mat_id. Might not be worth the extra indirection, though.

Why are sequential seq_ids required when virtual sequences are used?

Is it because a contiguous (along the virtual sequence dimension) slice of the KV cache is used?

Yes, we make a view of the KV cache across the streams here:

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 976 to 992 in dbcfcaa

ggml_tensor * llama_kv_cache_unified::get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const {

const int32_t ikv = map_layer_ids.at(il);

auto * k = layers[ikv].k;

const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;

const uint64_t kv_size = get_size();

return ggml_view_4d(ctx, k,

hparams.n_embd_head_k, hparams.n_head_kv(il), n_kv, ns,

ggml_row_size(k->type, hparams.n_embd_head_k),

ggml_row_size(k->type, hparams.n_embd_k_gqa(il)),

ggml_row_size(k->type, hparams.n_embd_k_gqa(il)*kv_size),

ggml_row_size(k->type, hparams.n_embd_k_gqa(il)*kv_size)*sinfo.s0);

}

The ns var is the number of streams that participate in the current ubatch. Their stream indices range from [s0, s1].

I wonder if there could be a way to avoid this requirement with ggml_get_rows and/or ggml_mul_mat_id. Might not be worth the extra indirection, though.

It should be possible. But I'm not sure if it would be worth - both in performance and in complexity. We can explore though.

compilade · 2025-07-03T14:27:55Z

src/llama-kv-cache-unified.cpp

@@ -45,7 +46,7 @@ llama_kv_cache_unified::llama_kv_cache_unified(
        auto it = ctx_map.find(buft);
        if (it == ctx_map.end()) {
            ggml_init_params params = {
-                /*.mem_size   =*/ size_t(2u*n_layer_cache*ggml_tensor_overhead()),
+                /*.mem_size   =*/ size_t(2u*(1 + n_seq_virt)*n_layer_cache*ggml_tensor_overhead()),


Is the 1 + intended? Why was it added?

For the per-stream views of the KV cache:

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 125 to 133 in dbcfcaa

std::vector<ggml_tensor *> k_seq;

std::vector<ggml_tensor *> v_seq;

for (uint32_t s = 0; s < n_seq_virt; ++s) {

k_seq.push_back(ggml_view_2d(ctx, k, n_embd_k_gqa, kv_size, k->nb[1], s*k->nb[2]));

v_seq.push_back(ggml_view_2d(ctx, v, n_embd_v_gqa, kv_size, v->nb[1], s*v->nb[2]));

}

These are used to implement the llama_memory_seq_cp(). This operation is no longer just assigning ids - it performs actual copy of the buffers in memory when we use multiple streams. Using these helper views, the operation is quite simple to implement:

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 289 to 329 in dbcfcaa

bool is_full = true;

if (p0 > 0 && p0 + 1 < (int) get_size()) {

is_full = false;

}

if (p1 > 0 && p1 + 1 < (int) get_size()) {

is_full = false;

}

GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers");

//LLAMA_LOG_WARN("%s: copying KV buffer from %d (virt = %d) to %d (virt = %d)\n", __func__, seq_id_src, s0, seq_id_dst, s1);

for (uint32_t il = 0; il < layers.size(); ++il) {

const auto & layer = layers[il];

ggml_backend_tensor_copy(layer.k_seq[s0], layer.k_seq[s1]);

ggml_backend_tensor_copy(layer.v_seq[s0], layer.v_seq[s1]);

// TODO: do we need synchronization here?

}

// TODO: support this:

GGML_ASSERT(v_cells[s0].get_has_shift() == false && "cannot copy a KV buffer that has a pending shift");

v_cells[s1].reset();

for (uint32_t i = 0; i < v_cells[s0].size(); ++i) {

if (v_cells[s0].seq_has(i, seq_id_src)) {

v_cells[s1].pos_set(i, v_cells[s0].pos_get(i));

v_cells[s1].seq_add(i, seq_id_dst);

}

}

v_heads[s1] = v_heads[s0];

//for (uint32_t s = 0; s < n_seq_virt; ++s) {

// LLAMA_LOG_WARN("%s: seq %d: min = %d, max = %d\n", __func__, s, v_cells[s].seq_pos_min(s), v_cells[s].seq_pos_max(s));

//}

}

Though we cannot copy partial sequences when using multiple streams.

compilade · 2025-07-03T15:28:47Z

src/llama-batch.cpp

+        // accept only increasing sequence ids
+        if (sequential) {
+            add = add && (cur_seq_set.empty() || batch.seq_id[i][0] == last_seq_id + 1);
+        }


What about decreasing sequence ids? Is the requirement that they are increasing, or that the included seq_ids should be in a contiguous range?

(decreasing sequence ids might not really happen often in practice though)

Decreasing would also work - we just need continuous range. We can either add this, if there is an elegant way to search for this. Or we add some batch pre-processing step to move the complexity at a higher level. Or just delegate it to the user by warning when the batch is not arranged optimally.

ggml-ci

ggerganov · 2025-07-04T16:02:01Z

@slaren PTAL - any suggestions are welcome. Note there is currently no way to test this on non-Apple hardware until the necessary operators are implemented by the backends.

ggml-ci

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 24, 2025

ggerganov force-pushed the gg/llama-high-throughput branch from ab2a2bb to 1b74b9d Compare June 24, 2025 17:24

ggerganov mentioned this pull request Jun 26, 2025

ggml : add ggml_set_rows #14274

Merged

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 3 times, most recently from c246784 to 06bb08a Compare June 27, 2025 14:35

ggerganov mentioned this pull request Jun 28, 2025

ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

Merged

5 tasks

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 3 times, most recently from 82277da to 4534123 Compare June 30, 2025 14:08

ggerganov mentioned this pull request Jul 1, 2025

kv-cache : use ggml_set_rows #14285

Merged

5 tasks

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 2f577c5 to 30b4d4e Compare July 2, 2025 12:49

ggerganov mentioned this pull request Jul 2, 2025

ggml : fix FA mask dim 2 and 3 #14505

Merged

ggerganov force-pushed the gg/llama-high-throughput branch from 6179578 to dfceb01 Compare July 2, 2025 18:20

Base automatically changed from gg/kv-cache-use-set-rows to master July 3, 2025 07:53

ggerganov force-pushed the gg/llama-high-throughput branch 2 times, most recently from eb5856c to ee0f729 Compare July 3, 2025 08:12

This was referenced Jul 3, 2025

batch : add optional for sequential equal split #14511

Merged

batch : add n_used count #14512

Merged

ggerganov force-pushed the gg/llama-high-throughput branch from ee0f729 to deae7cd Compare July 3, 2025 08:53

ggerganov mentioned this pull request Jul 3, 2025

graph : prepare for 4D mask #14515

Merged

ggerganov force-pushed the gg/llama-high-throughput branch 2 times, most recently from 988d0cd to dbcfcaa Compare July 3, 2025 12:11

ggerganov mentioned this pull request Jul 3, 2025

kv-cache : prepare K/V buffers for separation #14517

Open

compilade reviewed Jul 3, 2025

View reviewed changes

ggerganov force-pushed the gg/llama-high-throughput branch from dbcfcaa to 33dcc3c Compare July 4, 2025 07:04

kv-cache : prepare K/V buffers for separation

b123d89

ggml-ci

ggerganov force-pushed the gg/llama-high-throughput branch from 33dcc3c to 5363817 Compare July 4, 2025 07:54

ggerganov added 3 commits July 4, 2025 13:36

batched-bench : fix oob write

ab8443d

ggml-ci

llama : add "virtual sequences"

38479e2

ggml-ci

llama : use "stream" vs "virtual sequence"

7b00429

ggml-ci

ggerganov force-pushed the gg/llama-high-throughput branch from 5363817 to 7b00429 Compare July 4, 2025 10:53

graph : fix stream splitting when KV cache is not used

20e317c

ggml-ci

ggerganov force-pushed the gg/llama-high-throughput branch 2 times, most recently from d04f824 to fa2573e Compare July 4, 2025 14:48

ggerganov marked this pull request as ready for review July 4, 2025 15:31

ggerganov added 3 commits July 4, 2025 18:58

kv-cache : add multi-stream save/load support

f2d0820

ggml-ci

llama : add "--attn-streams" flag

280b0b9

ggml-ci

kv-cache : fix handling when find_slot fails

5c00eb2

ggml-ci

ggerganov force-pushed the gg/llama-high-throughput branch from c96c48c to 5c00eb2 Compare July 4, 2025 15:58

kv-cache : restore find_slot impl

31feaee

ggml-ci

xunjieliu mentioned this pull request Jul 5, 2025

Reddit News Daily 2025-07-05 xunjieliu/reddit-daily-news#111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add high-throughput mode #14363

llama : add high-throughput mode #14363

ggerganov commented Jun 24, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jun 24, 2025

Uh oh!

ggerganov commented Jun 24, 2025 •

edited

Loading

Uh oh!

compilade Jul 3, 2025 •

edited

Loading

Uh oh!

ggerganov Jul 3, 2025 •

edited

Loading

Uh oh!

compilade Jul 3, 2025

Uh oh!

ggerganov Jul 3, 2025

Uh oh!

compilade Jul 3, 2025

Uh oh!

ggerganov Jul 3, 2025

Uh oh!

compilade Jul 3, 2025

Uh oh!

ggerganov Jul 3, 2025

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!


	// split the batch into streams if needed
	const auto n_stream = k->ne[3];

	q = ggml_reshape_4d(ctx0, q, q->ne[0], q->ne[1], q->ne[2]/n_stream, n_stream);

	q = ggml_permute(ctx0, q, 0, 2, 1, 3);
	k = ggml_permute(ctx0, k, 0, 2, 1, 3);
	v = ggml_permute(ctx0, v, 0, 2, 1, 3);

		// if sequential == true, the tokens in the ubatch will have increasing sequential sequence ids
		llama_ubatch split_equal(uint32_t n_ubatch, bool sequential);

	ggml_tensor * llama_kv_cache_unified::get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const {
	const int32_t ikv = map_layer_ids.at(il);

	auto * k = layers[ikv].k;

	const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;

	const uint64_t kv_size = get_size();

	return ggml_view_4d(ctx, k,
	hparams.n_embd_head_k, hparams.n_head_kv(il), n_kv, ns,
	ggml_row_size(k->type, hparams.n_embd_head_k),
	ggml_row_size(k->type, hparams.n_embd_k_gqa(il)),
	ggml_row_size(k->type, hparams.n_embd_k_gqa(il)*kv_size),
	ggml_row_size(k->type, hparams.n_embd_k_gqa(il)kv_size)sinfo.s0);
	}


	std::vector<ggml_tensor *> k_seq;
	std::vector<ggml_tensor *> v_seq;

	for (uint32_t s = 0; s < n_seq_virt; ++s) {
	k_seq.push_back(ggml_view_2d(ctx, k, n_embd_k_gqa, kv_size, k->nb[1], s*k->nb[2]));
	v_seq.push_back(ggml_view_2d(ctx, v, n_embd_v_gqa, kv_size, v->nb[1], s*v->nb[2]));
	}


	bool is_full = true;

	if (p0 > 0 && p0 + 1 < (int) get_size()) {
	is_full = false;
	}

	if (p1 > 0 && p1 + 1 < (int) get_size()) {
	is_full = false;
	}

	GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers");

	//LLAMA_LOG_WARN("%s: copying KV buffer from %d (virt = %d) to %d (virt = %d)\n", __func__, seq_id_src, s0, seq_id_dst, s1);

	for (uint32_t il = 0; il < layers.size(); ++il) {
	const auto & layer = layers[il];

	ggml_backend_tensor_copy(layer.k_seq[s0], layer.k_seq[s1]);
	ggml_backend_tensor_copy(layer.v_seq[s0], layer.v_seq[s1]);

	// TODO: do we need synchronization here?
	}

	// TODO: support this:
	GGML_ASSERT(v_cells[s0].get_has_shift() == false && "cannot copy a KV buffer that has a pending shift");

	v_cells[s1].reset();
	for (uint32_t i = 0; i < v_cells[s0].size(); ++i) {
	if (v_cells[s0].seq_has(i, seq_id_src)) {
	v_cells[s1].pos_set(i, v_cells[s0].pos_get(i));
	v_cells[s1].seq_add(i, seq_id_dst);
	}
	}

	v_heads[s1] = v_heads[s0];

	//for (uint32_t s = 0; s < n_seq_virt; ++s) {
	// LLAMA_LOG_WARN("%s: seq %d: min = %d, max = %d\n", __func__, s, v_cells[s].seq_pos_min(s), v_cells[s].seq_pos_max(s));
	//}
	}

llama : add high-throughput mode #14363

Are you sure you want to change the base?

llama : add high-throughput mode #14363

Conversation

ggerganov commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Description

API Changes

Testing

Qwen 2.5 Coder 3B Q8_0, M2 Ultra

Geamma 3 4B Q8_0, M2 Ultra

TODO

Next PRs

Uh oh!

JohannesGaessler commented Jun 24, 2025

Uh oh!

ggerganov commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 24, 2025 •

edited

Loading

ggerganov commented Jun 24, 2025 •

edited

Loading

compilade Jul 3, 2025 •

edited

Loading

ggerganov Jul 3, 2025 •

edited

Loading