modelai-llama.cpp

Production fork of llama.cpp adding KV cache compaction, Metal GPU acceleration, multi-architecture support, and production-grade QA infrastructure. Zero regression on upstream performance.

MIT License 463+ Commits Ahead 29 Bugs Fixed
Quality
0.997
Logit cosine similarity (Qwen3-8B, select)
Decode Speedup
+63%
After 8x compaction at 8K context
Effective Context
256K
From 64K physical KV cache
Upstream Regression
0%
Fork matches upstream within 2%

KV Cache Compaction

Based on "Fast KV Compaction via Attention Matching" (Zweiger et al., MIT Han Lab). Instead of truncating or evicting old context, compaction compresses the KV cache into a smaller learned representation that preserves attention behavior.

Fill 64K KV cache -> Compact at 4x -> 16K compacted + 48K free -> Fill again -> Repeat Each cycle advances position IDs by ~48K 256K effective context from 64K physical allocation
  1. Select key positions by attention score (62-185ms)
  2. Fit optional additive bias (beta) and compressed values (V)
  3. Prepend compacted tensors to live KV during attention
  4. Reclaim freed KV slots for new tokens

7 Compaction Pipelines

PipelineSpeedQualityUse Case
selectFast (62-185ms)0.946-0.999Production default, flash-compatible
solverMediumHigherFull beta + V fitting
ompSlowHighestOrthogonal Matching Pursuit
self_studySlowHighOn-policy Q generation via autoregressive continuation
chunkedMediumHighLong context (>8K prefix)
on_policySlowHighestIterative quality-gated refinement
sequentialVery slowHighestPer-layer sequential on-policy

Pure C++ Solver Engine

Complete Attention Matching solver with zero external dependencies. Builds on any platform cmake supports.

NNLS Beta Fitting
Projected gradient descent with lstsq+clamp fallback
Least-Squares V
3-tier cascade: LAPACK sgels, Cholesky, aggressive Cholesky
NEON SIMD
ARM NEON-optimized dot product for Apple Silicon
Metal GPU
Attention score + XtX assembly on Apple Silicon
Spectral Ridge
Power iteration for regularization parameter scaling
Quantized K/V
Q8_0, Q4_K extraction with block-aligned dequantization

5 Architecture Support

ArchitectureStatusExample Models
Standard attentionSupportedLlama, Qwen2.5, Mistral, DeepSeek-R1, Aya, Granite
iSWA (interleaved sliding window)SupportedQwen3-8B, Qwen3-14B, Qwen3-30B-A3B
Hybrid SSM+AttentionSupportedGranite3.1-Dense-8B
Hybrid-iSWASupportedHybrid + interleaved sliding window
IMROPE (multi-resolution RoPE)SupportedQwen3.5, Qwen3.5-MOE
Pure recurrent (Mamba/RWKV)N/ANo KV cache

Public C API and Features

Public C API
llama_kv_cache_compact() — trigger compaction
llama_kv_cache_set_auto_compact() — threshold-based auto-trigger
llama_kv_cache_compact_info() — query compaction state
Auto-Compaction
Threshold-triggered with one-shot guard. Server sets threshold via --compact-threshold, compaction fires automatically when KV fill exceeds it.
State Persistence
Compacted prefix survives save/restore (version 2 serialization with is_imrope flag). Full session recovery across server restart.
Per-Layer Flash Hybrid
Zero-beta layers use flash attention, non-zero use standard. Automatic per-layer decision at graph build time.

Server Integration

# Build cmake -B build -DGGML_METAL=ON cmake --build build --config Release -j$(nproc) # Run with compaction enabled ./build/bin/llama-server \ -m model.gguf \ --endpoint-compact \ -c 8192 # Trigger compaction via REST API curl -X POST http://localhost:8080/compact \ -H "Content-Type: application/json" \ -d '{"id_slot": 0, "method": "select", "ratio": 4.0}' # Query compaction state curl http://localhost:8080/props | jq '.default_generation_settings.compacted_prefix'

Server extensions: /compact REST endpoint, /props compaction state, Prometheus gauges (llamacpp:modelai_active_n_kv_total, llamacpp:modelai_active_n_kv_max, llamacpp:modelai_sequence_state_bytes_total).

29 Bug Fixes

Every bug found through adversarial review, fixed with root-cause analysis, and verified with regression tests. Full list with commit SHAs.

Critical/Major
18
Crashes, data corruption, quality loss
Minor/Infra
11
CI fixes, build issues, diagnostics
Security
1
RPC RCE synced same-day from upstream

Notable fixes: post-compaction crash on stale prompt cache, nonuniform NaN propagation, 80+ Metal GPU sync stalls per decode, OMP infinite loop at high compression, KV shift position wrap, quantized V misalignment.

CI and Testing Infrastructure

CI Workflows
7
Active on every push/PR
Engine Test Tiers
8
Pytests to live dashboard
CI Tests
49
Main-label, all passing
Models Validated
17
15 pass 0.95 cosine gate
WorkflowPlatformTrigger
modelai-cimacOS (Apple Silicon)Push, PR
modelai-server-smokemacOSPush, PR
modelai-perf-smokemacOSPush, PR
modelai-ci-windowsWindows (MSVC x64)Push, PR
modelai-upstream-syncmacOSSaturday 2PM PDT
modelai-dashboardmacOSAfter CI success
modelai-auto-labelGitHubIssues, PRs

Weekly Upstream Sync and Review

Weekly automated sync from ggml-org/llama.cpp. Every merge is CI-gated: build + full test suite + benchmark regression check must pass. Changes touching KV cache or attention paths trigger manual review before integration. Full process details.

Schedule
Saturday 2PM PDT
Automated CI-gated merge
Branch Model
3 Branches
modelai-main, upstream-master, upstream-sync
Security Patches
Same-Day
Emergency sync for CVEs

6 upstream KV cache changes audited and verified compatible. RPC RCE security patch synced within hours of upstream disclosure.

Validated Models

17 models tested, 15 pass the 0.95 logit cosine similarity quality gate (select pipeline).

ModelArchitectureQuality Gate
Qwen3-8BiSWAPass (0.997)
Qwen3-14BiSWAPass (0.992-0.999)
Qwen3-30B-A3BiSWA (MoE)Pass (0.999)
Qwen2.5-Coder-14BStandardPass
Qwen2.5-7BStandardPass
Qwen2.5-14BStandardPass
DeepSeek-R1-14BStandardPass (0.993-0.999)
DeepSeek-R1-8BStandardPass
Phi4-14BStandardPass
Mistral-7BStandardPass
CodeGemma-7BStandardPass
Granite3.1-Dense-8BHybrid SSMPass
Llama3.1-8BStandardPass
Llama-3.2-3BStandardPass
Aya-Expanse-8BStandardPass
Gemma2-9BStandardPass
TinyLlama-1.1BStandardPass
CURRENT ACTIVE MODELS
Qwen3-Coder-30B-A3B-1MiSWA (MoE)In use
Qwen3.5-35B-A3BiSWA (MoE)In use
Qwen3-30B-A3B-Instruct-2507iSWA (MoE)In use
Qwen3-30B-A3B-Thinking-2507iSWA (MoE)In use
Gemma-3-4B-ITStandardIn use

Documentation

DocumentDescription
CHANGELOGFull implementation history: V0 → V1 → V2 → V5 → Phase 8 → Phase D → OSS Launch
DESIGN-DECISIONS13 architecture decisions with rationale
BUGS-AND-FIXESAll 29 bugs: root cause, fix, commit SHA
UPSTREAM-SYNCSync process, CI workflows, 3-branch model
AlgorithmSelection, fitting, execution stages
IntegrationFile map and architecture support matrix
Benchmarks3-way fork vs upstream vs Ollama comparison
Paper ComparisonFork vs arXiv:2602.16284 MIT reference