Fast KV Compaction via Attention Matching asks a practical question: if a model has already seen a long prefix, can you replace that large KV cache with a much smaller one that makes the model behave almost the same way? The paper’s answer is yes, if the compacted cache is optimized to preserve the attention-weighted output rather than simply keeping a subset of tokens.
Most long-context management strategies throw information away. Attention Matching instead tries to build a smaller synthetic memory that behaves like the larger original memory. The model does not get the old tokens back verbatim. It gets a compacted prefix designed to preserve the attention pattern that matters downstream.
The paper gives the algorithmic idea and a Python reference path. A production fork has to integrate that idea into a real inference engine, handle architecture edge cases, and fail safely under long-running server workloads.
| Dimension | Paper / reference | Fork implementation |
|---|---|---|
| Runtime language | Python reference code | Integrated C++ implementation inside llama.cpp runtime paths |
| Execution surface | Research code path | Server endpoint, C API entry points, auto-compaction, save/restore state |
| Architecture coverage | Core attention path | Standard attention plus documented handling for iSWA and hybrid memory layouts |
| Operational behavior | Not the focus | Crash handling, QA gates, upstream sync, contract tests, and benchmark reporting |
| Published defaults | Algorithmic exploration | Select pipeline framed as the practical default in the current docs |
The fork’s paper-comparison and bug-history docs record three issues that were found while translating the method into a production runtime:
Source trail: docs/BUGS-AND-FIXES.md and docs/paper-comparison-2602.16284.md.
The hardest part was not writing equations from the paper into C++. The hard part was integrating compacted-prefix execution into KV cache management, graph construction, runtime gating, and server lifecycle without making the normal path worse. That is why the repo carries dedicated integration docs, bug history, and QA process documentation alongside the algorithm notes.
| Document | Why read it |
|---|---|
| arXiv:2602.16284 | The original Attention Matching paper. |
| paper-comparison-2602.16284.md | Fork-specific comparison against the paper’s reference implementation. |
| kv-compaction-algorithm.md | Algorithm stages and implementation notes. |
| kv-compaction-integration.md | File map, runtime integration points, architecture support matrix, and compile-time guards. |