Skip to content

Paper questions about kv-cache #2

@yyysst

Description

@yyysst

Hi~ Spatial-TTT is an excellent job!

I have a question regarding the description in Section 3.4 of the paper: "At inference time, we employ a dual KV cache mechanism for constant-memory streaming. The first is a sliding window KV cache of fixed length w for local context modeling in sliding window attention; when the cache exceeds the window size, the earliest entries are discarded. The second is a TTT pending KV cache that accumulates key-value pairs for fast weight updates: it starts empty and grows as new tokens arrive; whenever its length reaches the chunk size b, these KV pairs are used to perform one fast weight update, after which the pending cache is cleared."

The paper only describes the KV cache processing for SWA and TTT, but how is the full attention layer handled?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions