Production fork of llama.cpp adding KV cache compaction, Metal GPU acceleration, multi-architecture support, and production-grade QA infrastructure. Zero regression on upstream performance.
MIT License 463+ Commits Ahead 29 Bugs FixedBased on "Fast KV Compaction via Attention Matching" (Zweiger et al., MIT Han Lab). Instead of truncating or evicting old context, compaction compresses the KV cache into a smaller learned representation that preserves attention behavior.
| Pipeline | Speed | Quality | Use Case |
|---|---|---|---|
select | Fast (62-185ms) | 0.946-0.999 | Production default, flash-compatible |
solver | Medium | Higher | Full beta + V fitting |
omp | Slow | Highest | Orthogonal Matching Pursuit |
self_study | Slow | High | On-policy Q generation via autoregressive continuation |
chunked | Medium | High | Long context (>8K prefix) |
on_policy | Slow | Highest | Iterative quality-gated refinement |
sequential | Very slow | Highest | Per-layer sequential on-policy |
Complete Attention Matching solver with zero external dependencies. Builds on any platform cmake supports.
| Architecture | Status | Example Models |
|---|---|---|
| Standard attention | Supported | Llama, Qwen2.5, Mistral, DeepSeek-R1, Aya, Granite |
| iSWA (interleaved sliding window) | Supported | Qwen3-8B, Qwen3-14B, Qwen3-30B-A3B |
| Hybrid SSM+Attention | Supported | Granite3.1-Dense-8B |
| Hybrid-iSWA | Supported | Hybrid + interleaved sliding window |
| IMROPE (multi-resolution RoPE) | Supported | Qwen3.5, Qwen3.5-MOE |
| Pure recurrent (Mamba/RWKV) | N/A | No KV cache |
llama_kv_cache_compact() — trigger compactionllama_kv_cache_set_auto_compact() — threshold-based auto-triggerllama_kv_cache_compact_info() — query compaction state
--compact-threshold, compaction fires automatically when KV fill exceeds it.is_imrope flag). Full session recovery across server restart.Server extensions: /compact REST endpoint, /props compaction state, Prometheus gauges (llamacpp:modelai_active_n_kv_total, llamacpp:modelai_active_n_kv_max, llamacpp:modelai_sequence_state_bytes_total).
Every bug found through adversarial review, fixed with root-cause analysis, and verified with regression tests. Full list with commit SHAs.
Notable fixes: post-compaction crash on stale prompt cache, nonuniform NaN propagation, 80+ Metal GPU sync stalls per decode, OMP infinite loop at high compression, KV shift position wrap, quantized V misalignment.
| Workflow | Platform | Trigger |
|---|---|---|
| modelai-ci | macOS (Apple Silicon) | Push, PR |
| modelai-server-smoke | macOS | Push, PR |
| modelai-perf-smoke | macOS | Push, PR |
| modelai-ci-windows | Windows (MSVC x64) | Push, PR |
| modelai-upstream-sync | macOS | Saturday 2PM PDT |
| modelai-dashboard | macOS | After CI success |
| modelai-auto-label | GitHub | Issues, PRs |
Weekly automated sync from ggml-org/llama.cpp. Every merge is CI-gated: build + full test suite + benchmark regression check must pass. Changes touching KV cache or attention paths trigger manual review before integration. Full process details.
6 upstream KV cache changes audited and verified compatible. RPC RCE security patch synced within hours of upstream disclosure.
17 models tested, 15 pass the 0.95 logit cosine similarity quality gate (select pipeline).
| Model | Architecture | Quality Gate |
|---|---|---|
| Qwen3-8B | iSWA | Pass (0.997) |
| Qwen3-14B | iSWA | Pass (0.992-0.999) |
| Qwen3-30B-A3B | iSWA (MoE) | Pass (0.999) |
| Qwen2.5-Coder-14B | Standard | Pass |
| Qwen2.5-7B | Standard | Pass |
| Qwen2.5-14B | Standard | Pass |
| DeepSeek-R1-14B | Standard | Pass (0.993-0.999) |
| DeepSeek-R1-8B | Standard | Pass |
| Phi4-14B | Standard | Pass |
| Mistral-7B | Standard | Pass |
| CodeGemma-7B | Standard | Pass |
| Granite3.1-Dense-8B | Hybrid SSM | Pass |
| Llama3.1-8B | Standard | Pass |
| Llama-3.2-3B | Standard | Pass |
| Aya-Expanse-8B | Standard | Pass |
| Gemma2-9B | Standard | Pass |
| TinyLlama-1.1B | Standard | Pass |
| CURRENT ACTIVE MODELS | ||
| Qwen3-Coder-30B-A3B-1M | iSWA (MoE) | In use |
| Qwen3.5-35B-A3B | iSWA (MoE) | In use |
| Qwen3-30B-A3B-Instruct-2507 | iSWA (MoE) | In use |
| Qwen3-30B-A3B-Thinking-2507 | iSWA (MoE) | In use |
| Gemma-3-4B-IT | Standard | In use |
| Document | Description |
|---|---|
| CHANGELOG | Full implementation history: V0 → V1 → V2 → V5 → Phase 8 → Phase D → OSS Launch |
| DESIGN-DECISIONS | 13 architecture decisions with rationale |
| BUGS-AND-FIXES | All 29 bugs: root cause, fix, commit SHA |
| UPSTREAM-SYNC | Sync process, CI workflows, 3-branch model |
| Algorithm | Selection, fitting, execution stages |
| Integration | File map and architecture support matrix |
| Benchmarks | 3-way fork vs upstream vs Ollama comparison |
| Paper Comparison | Fork vs arXiv:2602.16284 MIT reference |