Hi~ Spatial-TTT is an excellent job!
I have a question regarding the description in Section 3.4 of the paper: "At inference time, we employ a dual KV cache mechanism for constant-memory streaming. The first is a sliding window KV cache of fixed length w for local context modeling in sliding window attention; when the cache exceeds the window size, the earliest entries are discarded. The second is a TTT pending KV cache that accumulates key-value pairs for fast weight updates: it starts empty and grows as new tokens arrive; whenever its length reaches the chunk size b, these KV pairs are used to perform one fast weight update, after which the pending cache is cleared."
The paper only describes the KV cache processing for SWA and TTT, but how is the full attention layer handled?
Hi~ Spatial-TTT is an excellent job!
I have a question regarding the description in Section 3.4 of the paper: "At inference time, we employ a dual KV cache mechanism for constant-memory streaming. The first is a sliding window KV cache of fixed length w for local context modeling in sliding window attention; when the cache exceeds the window size, the earliest entries are discarded. The second is a TTT pending KV cache that accumulates key-value pairs for fast weight updates: it starts empty and grows as new tokens arrive; whenever its length reaches the chunk size b, these KV pairs are used to perform one fast weight update, after which the pending cache is cleared."
The paper only describes the KV cache processing for SWA and TTT, but how is the full attention layer handled?