Benchmarks

This is the evidence page. It carries the published Qwen decode tables, the iterative 50x / 256K result from the changelog, and the summary ranges cited elsewhere on the site.

Primary benchmark source: docs/benchmark-fork-vs-upstream.md (2026-03-13). Hardware: Apple M2 Pro, 32GB unified memory, Metal GPU, 128-token generation. Negative rows are included intentionally.
Baseline parity
within 2%
modelai-llama.cpp vs upstream llama.cpp on the published decode benchmark.
Best published decode gain
+63%
Qwen3-8B at 8K / 8x on the select pipeline.
Best published quality
0.999
Multiple select-pipeline rows in the published Qwen tables.
Largest published slowdown
-31%
Qwen3-8B at 16K / 2x. This page includes the negative rows too.

1. Baseline runtime comparison (no compaction)

This table matters because it shows what happens when compaction is not used. The fork stays close to upstream on decode speed, which means the compaction code does not introduce a baseline penalty in the published benchmark run.

Model modelai Gen (t/s) upstream Gen (t/s) Ollama Gen (t/s) modelai vs upstream modelai vs Ollama
Qwen3-8B 30.4 30.1 29.2 +1.0% +4.1%
Qwen3-14B 17.1 17.4 16.7 -1.7% +2.4%
Qwen3-30B-A3B 52.8 51.8 46.7 +1.9% +13.1%

Prompt processing throughput

Same benchmark run, prompt-side throughput. Included for completeness — Ollama leads here.

Model modelai Prompt (t/s) upstream Prompt (t/s) Ollama Prompt (t/s)
Qwen3-8B 93.6 92.4 131.0
Qwen3-14B 52.0 52.5 73.3
Qwen3-30B-A3B 85.8 97.4 112.5

2. Compacted decode speed (select pipeline)

This is the main published evidence for the fork’s practical value. It is also where the tradeoffs show up most clearly. Higher ratios and longer contexts can help. Lower ratios can be neutral or slower. The negative rows are kept here intentionally.

Qwen3-8B

Context Ratio Cosine Compacted Decode (t/s) Baseline Decode (t/s) Effective Speedup
4K2x0.99910.010.7-7%
4K4x0.99912.310.7+15%
4K8x0.99713.810.7+29%
8K2x0.9996.15.7+7%
8K4x0.9988.05.7+40%
8K8x0.9979.35.7+63%
16K2x0.9993.34.8-31%
16K4x0.9994.54.8-6%
16K8x0.9995.34.8+10%

Qwen3-30B-A3B

Context Ratio Cosine Compacted Decode (t/s) Baseline Decode (t/s) Effective Speedup
4K2x0.99914.013.5+4%
4K4x0.99917.613.5+30%
4K8x0.99920.213.5+50%

3. Quality heatmap (select pipeline)

The current public quality story is strongest for the select pipeline. These are the exact heatmap values published in the benchmark note.

Model 4K / 2x 4K / 4x 4K / 8x 8K / 2x 8K / 4x 8K / 8x
Qwen3-8B0.9990.9990.9970.9990.9980.997
Qwen3-30B-A3B0.9990.9990.999
Qwen3-14B0.9950.9960.9920.9990.9970.973
DeepSeek-R1-14B0.9990.9980.993

4. Iterative 50x / 256K result

The iterative context-extension result is real and worth showing, but it needs to be framed carefully. It is a specific iterative on-policy test from docs/CHANGELOG.md, not the default single-pass benchmark configuration used in the runtime comparison note above.

50x
iterative test

The CHANGELOG reports a 50x compression run on Qwen3-30B-A3B with 0.9967 cosine and 10/10 fact recall, plus 256K effective context from 64K physical memory across 49-57 compaction cycles. That is a specific iterative refinement result. It should be read as an existence proof for the technique, not as the default operating point of the published runtime benchmark suite.

Source: docs/CHANGELOG.md.

5. Summary ranges from HIGHLIGHTS.md

Select pipeline quality range
0.946-0.999
Across 15 models in docs/HIGHLIGHTS.md.
Compaction time
62-518ms
Depends on context length and compression ratio.
Models tested
17 / 15 pass
17 tested, 15 passing the current quality gate.

6. How to read these results

Reasonable conclusions
  • The fork does not show a meaningful published baseline decode regression against upstream.
  • The select pipeline can improve decode speed substantially at some higher-ratio / longer-context operating points.
  • The public quality numbers support "near-identical" behavior in the strongest published cases, not universal exact equivalence.
Bad conclusions
  • Do not read the 50x result as the default benchmark outcome.
  • Do not assume every compression ratio improves speed.
  • Do not mix the published cosine benchmark set with other experiments that use different scoring methods.

Primary source trail

Fork vs upstream benchmark
Three-way baseline comparison, compacted decode tables, and the Qwen quality heatmap.
CHANGELOG
Specific iterative 50x / 256K result, including the fact-recall note and cycle counts.
HIGHLIGHTS
One-page summary of ranges, model coverage, test counts, and implementation scope.