KV cache compaction for llama.cpp
Longer local sessions without throwing away context.
modelai-llama.cpp is a production fork of llama.cpp that compacts KV cache instead of dropping older history when memory fills. For users, that means longer local sessions with less forgetting. For developers, it adds compaction pipelines, auto-compaction, persistence, and public server/C API surfaces while staying close to upstream.
Current public benchmarks show baseline decode parity with upstream, a best published +63% gain at 8K / 8x on Qwen3-8B, and a documented iterative 256K-from-64K extension result. The benchmark page carries the full tables, including the negative rows.
MIT License
Drop-in llama.cpp fork
17 models, 5 architectures
Quality
0.997
Qwen3-8B, select pipeline, verified in docs/benchmark-fork-vs-upstream.md.
Best measured decode gain
+63%
Qwen3-8B at 8K context and 8x compression on Apple M2 Pro, 32GB.
Iterative extension result
256K
Specific on-policy test from 64K physical memory, documented in the CHANGELOG.
Baseline overhead
~0%
Fork matches upstream decode within 2% in the published three-way runtime benchmark.
The problem
Long-running local sessions have two separate failure modes. First, the runtime eventually runs out of KV cache and starts forgetting old context. Second, even before memory is exhausted, decode gets slower as the model attends over a larger history. That is why local sessions often feel fine early and then degrade as they get longer.
The approach
Compaction changes the failure mode. Instead of deleting old context, the fork compresses it into a smaller representation designed to preserve the same attention behavior. The result is not a subset of the original cache; it is a replacement intended to keep the model's behavior close to the uncompressed path while freeing memory.
1
Fill
Session history grows until the KV cache approaches its limit.
2
Compact
Older KV state is compressed rather than discarded.
3
Continue
Freed KV slots accept new tokens while the compacted prefix remains available.
4
Repeat
Iterative refinement can extend effective context far beyond the physical window.
What the current evidence shows
What looks strong today
- Baseline decode parity with upstream llama.cpp on the published three-way test.
- Meaningful decode gains at higher compression ratios and longer contexts.
- Select-pipeline quality staying in the 0.997-0.999 range for the strongest published cases.
- An iterative context-extension test reaching 256K effective context from 64K physical memory.
What needs to be read carefully
- Speedup depends heavily on context length and compression ratio.
- Published data includes slowdowns, including -31% at 16K / 2x on Qwen3-8B.
- The 50x / 256K result is a specific iterative on-policy experiment, not the default single-pass operating point.
- Quality and performance claims on this site are limited to the explicitly cited benchmark sources.
What is in the fork
Runtime surface
Same build flow as llama.cpp, server-side compaction endpoint, and C API entry points for programmatic control.
Execution options
Multiple compaction pipelines, with select positioned as the production default because it is the fastest and most flash-friendly path in the current docs.
Platform focus
Apple Silicon is the published benchmark platform today, with Metal GPU acceleration and benchmark data collected on an M2 Pro, 32GB machine.
State handling
Compacted state can survive save/restore, which matters for long-running sessions and server workflows.
Model coverage
17 models tested, 15 passing the current quality gate in docs/HIGHLIGHTS.md.
Engineering process
34 documented bug fixes, adversarial review, active CI, and an upstream-sync lane with local compatibility fixes are all part of the public repo history.
Recent runtime compatibility fixes in the fork include Qwen/Gemma tool-call parser hardening, the Qwen3.5 long-input tokenizer crash fix, short-session checkpoint reuse for hybrid/SWA server turns, and Metal mixed
q8_0/q4-q5 Flash-Attention KV support. Those upstream issue numbers are still open; the fixes ship in this fork now and are documented in
QA.
Using the fork
The project is intended to stay close to upstream llama.cpp ergonomics. Build it the normal way, then enable compaction when you want it.
cmake -B build
cmake --build build -j
./build/bin/llama-server -m model.gguf --endpoint-compact -c 8192
curl -X POST http://localhost:8080/compact \
-H "Content-Type: application/json" \
-d '{"ratio":4.0,"method":"select","reclaim":true}'
struct llama_compact_params params = llama_compact_default_params();
llama_kv_cache_compact(ctx, params);
Implementation details and internal file layout: docs/kv-compaction-integration.md.
Why the numbers are credible
Every claim on this site traces to one of three source documents:
| Source |
What it supports |
| docs/benchmark-fork-vs-upstream.md |
Three-way baseline comparison, compacted decode tables, and quality heatmap on Apple M2 Pro 32GB. |
| docs/CHANGELOG.md |
Specific iterative on-policy 50x / 256K result and the fact-recall checkpoint record. |
| docs/HIGHLIGHTS.md |
Quality range, compaction time range, test/bug summary, and high-level implementation scope. |
Read next
| Page |
Purpose |
| Benchmarks |
Full published benchmark tables, including the slowdown rows and methodology notes. |
| Paper |
Accessible explanation of Attention Matching and what the production fork adds beyond the paper. |
| QA |
Adversarial review protocol, CI/test structure, and upstream-sync process. |