Paper questions about kv-cache

Hi~ Spatial-TTT is an excellent job!

I have a question regarding the description in Section 3.4 of the paper: "At inference time, we employ a dual KV cache mechanism for constant-memory streaming. The first is a sliding window KV cache of fixed length w for local context modeling in sliding window attention; when the cache exceeds the window size, the earliest entries are discarded. The second is a TTT pending KV cache that accumulates key-value pairs for fast weight updates: it starts empty and grows as new tokens arrive; whenever its length reaches the chunk size b, these KV pairs are used to perform one fast weight update, after which the pending cache is cleared."

The paper only describes the KV cache processing for SWA and TTT, but how is the full attention layer handled?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper questions about kv-cache #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Paper questions about kv-cache #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions