KV cache compaction for llama.cpp

Longer local sessions without throwing away context.

modelai-llama.cpp is a production fork of llama.cpp that compacts KV cache instead of dropping older history when memory fills. For users, that means longer local sessions with less forgetting. For developers, it adds compaction pipelines, auto-compaction, persistence, and public server/C API surfaces while staying close to upstream.

Current public benchmarks show baseline decode parity with upstream, a best published +63% gain at 8K / 8x on Qwen3-8B, and a documented iterative 256K-from-64K extension result. The benchmark page carries the full tables, including the negative rows.

View Benchmarks Read The Paper Notes Open GitHub

MIT License Drop-in llama.cpp fork 17 models, 5 architectures

Quality

0.997

Qwen3-8B, select pipeline, verified in docs/benchmark-fork-vs-upstream.md.

Best measured decode gain

+63%

Qwen3-8B at 8K context and 8x compression on Apple M2 Pro, 32GB.

Iterative extension result

256K

Specific on-policy test from 64K physical memory, documented in the CHANGELOG.

Baseline overhead

~0%

Fork matches upstream decode within 2% in the published three-way runtime benchmark.

Public claim boundary: this site only leans on docs/benchmark-fork-vs-upstream.md, docs/CHANGELOG.md, and docs/HIGHLIGHTS.md for benchmark and implementation claims.

The problem

Long-running local sessions have two separate failure modes. First, the runtime eventually runs out of KV cache and starts forgetting old context. Second, even before memory is exhausted, decode gets slower as the model attends over a larger history. That is why local sessions often feel fine early and then degrade as they get longer.

The approach

Compaction changes the failure mode. Instead of deleting old context, the fork compresses it into a smaller representation designed to preserve the same attention behavior. The result is not a subset of the original cache; it is a replacement intended to keep the model's behavior close to the uncompressed path while freeing memory.

Fill Session history grows until the KV cache approaches its limit.

Compact Older KV state is compressed rather than discarded.

Continue Freed KV slots accept new tokens while the compacted prefix remains available.

Repeat Iterative refinement can extend effective context far beyond the physical window.

What the current evidence shows

What looks strong today

Baseline decode parity with upstream llama.cpp on the published three-way test.
Meaningful decode gains at higher compression ratios and longer contexts.
Select-pipeline quality staying in the 0.997-0.999 range for the strongest published cases.
An iterative context-extension test reaching 256K effective context from 64K physical memory.

What needs to be read carefully

Speedup depends heavily on context length and compression ratio.
Published data includes slowdowns, including -31% at 16K / 2x on Qwen3-8B.
The 50x / 256K result is a specific iterative on-policy experiment, not the default single-pass operating point.
Quality and performance claims on this site are limited to the explicitly cited benchmark sources.

What is in the fork

Runtime surface

Same build flow as llama.cpp, server-side compaction endpoint, and C API entry points for programmatic control.

Execution options

Multiple compaction pipelines, with select positioned as the production default because it is the fastest and most flash-friendly path in the current docs.

Platform focus

Apple Silicon is the published benchmark platform today, with Metal GPU acceleration and benchmark data collected on an M2 Pro, 32GB machine.

State handling

Compacted state can survive save/restore, which matters for long-running sessions and server workflows.

Model coverage

17 models tested, 15 passing the current quality gate in docs/HIGHLIGHTS.md.

Engineering process

34 documented bug fixes, adversarial review, active CI, and an upstream-sync lane with local compatibility fixes are all part of the public repo history.

Recent runtime compatibility fixes in the fork include Qwen/Gemma tool-call parser hardening, the Qwen3.5 long-input tokenizer crash fix, short-session checkpoint reuse for hybrid/SWA server turns, and Metal mixed q8_0/q4-q5 Flash-Attention KV support. Those upstream issue numbers are still open; the fixes ship in this fork now and are documented in QA.

Using the fork

The project is intended to stay close to upstream llama.cpp ergonomics. Build it the normal way, then enable compaction when you want it.

# Build like llama.cpp
cmake -B build
cmake --build build -j

# Run the server with compaction enabled
./build/bin/llama-server -m model.gguf --endpoint-compact -c 8192

# Trigger compaction through the server API
curl -X POST http://localhost:8080/compact \
  -H "Content-Type: application/json" \
  -d '{"ratio":4.0,"method":"select","reclaim":true}'

# Programmatic C API entry point
struct llama_compact_params params = llama_compact_default_params();
llama_kv_cache_compact(ctx, params);

Implementation details and internal file layout: docs/kv-compaction-integration.md.

Why the numbers are credible

Every claim on this site traces to one of three source documents:

Source	What it supports
docs/benchmark-fork-vs-upstream.md	Three-way baseline comparison, compacted decode tables, and quality heatmap on Apple M2 Pro 32GB.
docs/CHANGELOG.md	Specific iterative on-policy 50x / 256K result and the fact-recall checkpoint record.
docs/HIGHLIGHTS.md	Quality range, compaction time range, test/bug summary, and high-level implementation scope.

Page	Purpose
Benchmarks	Full published benchmark tables, including the slowdown rows and methodology notes.
Paper	Accessible explanation of Attention Matching and what the production fork adds beyond the paper.
QA	Adversarial review protocol, CI/test structure, and upstream-sync process.