Benchmarks

This is the evidence page. It carries the published Qwen decode tables, the iterative 50x / 256K result from the changelog, and the summary ranges cited elsewhere on the site.

Primary benchmark source: docs/benchmark-fork-vs-upstream.md (2026-03-13). Hardware: Apple M2 Pro, 32GB unified memory, Metal GPU, 128-token generation. Negative rows are included intentionally.

Baseline parity

within 2%

modelai-llama.cpp vs upstream llama.cpp on the published decode benchmark.

Best published decode gain

+63%

Qwen3-8B at 8K / 8x on the select pipeline.

Best published quality

0.999

Multiple select-pipeline rows in the published Qwen tables.

Largest published slowdown

-31%

Qwen3-8B at 16K / 2x. This page includes the negative rows too.

1. Baseline runtime comparison (no compaction)

This table matters because it shows what happens when compaction is not used. The fork stays close to upstream on decode speed, which means the compaction code does not introduce a baseline penalty in the published benchmark run.

Model	modelai Gen (t/s)	upstream Gen (t/s)	Ollama Gen (t/s)	modelai vs upstream	modelai vs Ollama
Qwen3-8B	30.4	30.1	29.2	+1.0%	+4.1%
Qwen3-14B	17.1	17.4	16.7	-1.7%	+2.4%
Qwen3-30B-A3B	52.8	51.8	46.7	+1.9%	+13.1%

Prompt processing throughput

Same benchmark run, prompt-side throughput. Included for completeness — Ollama leads here.

Model	modelai Prompt (t/s)	upstream Prompt (t/s)	Ollama Prompt (t/s)
Qwen3-8B	93.6	92.4	131.0
Qwen3-14B	52.0	52.5	73.3
Qwen3-30B-A3B	85.8	97.4	112.5

2. Compacted decode speed (select pipeline)

This is the main published evidence for the fork’s practical value. It is also where the tradeoffs show up most clearly. Higher ratios and longer contexts can help. Lower ratios can be neutral or slower. The negative rows are kept here intentionally.

Qwen3-8B

Context	Ratio	Cosine	Compacted Decode (t/s)	Baseline Decode (t/s)	Effective Speedup
4K	2x	0.999	10.0	10.7	-7%
4K	4x	0.999	12.3	10.7	+15%
4K	8x	0.997	13.8	10.7	+29%
8K	2x	0.999	6.1	5.7	+7%
8K	4x	0.998	8.0	5.7	+40%
8K	8x	0.997	9.3	5.7	+63%
16K	2x	0.999	3.3	4.8	-31%
16K	4x	0.999	4.5	4.8	-6%
16K	8x	0.999	5.3	4.8	+10%

Qwen3-30B-A3B

Context	Ratio	Cosine	Compacted Decode (t/s)	Baseline Decode (t/s)	Effective Speedup
4K	2x	0.999	14.0	13.5	+4%
4K	4x	0.999	17.6	13.5	+30%
4K	8x	0.999	20.2	13.5	+50%

3. Quality heatmap (select pipeline)

The current public quality story is strongest for the select pipeline. These are the exact heatmap values published in the benchmark note.

Model	4K / 2x	4K / 4x	4K / 8x	8K / 2x	8K / 4x	8K / 8x
Qwen3-8B	0.999	0.999	0.997	0.999	0.998	0.997
Qwen3-30B-A3B	0.999	0.999	0.999	—	—	—
Qwen3-14B	0.995	0.996	0.992	0.999	0.997	0.973
DeepSeek-R1-14B	0.999	0.998	0.993	—	—	—

4. Iterative 50x / 256K result

The iterative context-extension result is real and worth showing, but it needs to be framed carefully. It is a specific iterative on-policy test from docs/CHANGELOG.md, not the default single-pass benchmark configuration used in the runtime comparison note above.

50x

iterative test

The CHANGELOG reports a 50x compression run on Qwen3-30B-A3B with 0.9967 cosine and 10/10 fact recall, plus 256K effective context from 64K physical memory across 49-57 compaction cycles. That is a specific iterative refinement result. It should be read as an existence proof for the technique, not as the default operating point of the published runtime benchmark suite.

Source: docs/CHANGELOG.md.

5. Summary ranges from HIGHLIGHTS.md

Select pipeline quality range

0.946-0.999

Across 15 models in docs/HIGHLIGHTS.md.

Compaction time

62-518ms

Depends on context length and compression ratio.

Models tested

17 / 15 pass

17 tested, 15 passing the current quality gate.

6. How to read these results

Reasonable conclusions

The fork does not show a meaningful published baseline decode regression against upstream.
The select pipeline can improve decode speed substantially at some higher-ratio / longer-context operating points.
The public quality numbers support "near-identical" behavior in the strongest published cases, not universal exact equivalence.

Bad conclusions

Do not read the 50x result as the default benchmark outcome.
Do not assume every compression ratio improves speed.
Do not mix the published cosine benchmark set with other experiments that use different scoring methods.

Primary source trail

Fork vs upstream benchmark

Three-way baseline comparison, compacted decode tables, and the Qwen quality heatmap.

CHANGELOG

Specific iterative 50x / 256K result, including the fact-recall note and cycle counts.

HIGHLIGHTS

One-page summary of ranges, model coverage, test counts, and implementation scope.