This is the evidence page. It carries the published Qwen decode tables, the iterative 50x / 256K result from the changelog, and the summary ranges cited elsewhere on the site.
This table matters because it shows what happens when compaction is not used. The fork stays close to upstream on decode speed, which means the compaction code does not introduce a baseline penalty in the published benchmark run.
| Model | modelai Gen (t/s) | upstream Gen (t/s) | Ollama Gen (t/s) | modelai vs upstream | modelai vs Ollama |
|---|---|---|---|---|---|
| Qwen3-8B | 30.4 | 30.1 | 29.2 | +1.0% | +4.1% |
| Qwen3-14B | 17.1 | 17.4 | 16.7 | -1.7% | +2.4% |
| Qwen3-30B-A3B | 52.8 | 51.8 | 46.7 | +1.9% | +13.1% |
Same benchmark run, prompt-side throughput. Included for completeness — Ollama leads here.
| Model | modelai Prompt (t/s) | upstream Prompt (t/s) | Ollama Prompt (t/s) |
|---|---|---|---|
| Qwen3-8B | 93.6 | 92.4 | 131.0 |
| Qwen3-14B | 52.0 | 52.5 | 73.3 |
| Qwen3-30B-A3B | 85.8 | 97.4 | 112.5 |
This is the main published evidence for the fork’s practical value. It is also where the tradeoffs show up most clearly. Higher ratios and longer contexts can help. Lower ratios can be neutral or slower. The negative rows are kept here intentionally.
| Context | Ratio | Cosine | Compacted Decode (t/s) | Baseline Decode (t/s) | Effective Speedup |
|---|---|---|---|---|---|
| 4K | 2x | 0.999 | 10.0 | 10.7 | -7% |
| 4K | 4x | 0.999 | 12.3 | 10.7 | +15% |
| 4K | 8x | 0.997 | 13.8 | 10.7 | +29% |
| 8K | 2x | 0.999 | 6.1 | 5.7 | +7% |
| 8K | 4x | 0.998 | 8.0 | 5.7 | +40% |
| 8K | 8x | 0.997 | 9.3 | 5.7 | +63% |
| 16K | 2x | 0.999 | 3.3 | 4.8 | -31% |
| 16K | 4x | 0.999 | 4.5 | 4.8 | -6% |
| 16K | 8x | 0.999 | 5.3 | 4.8 | +10% |
| Context | Ratio | Cosine | Compacted Decode (t/s) | Baseline Decode (t/s) | Effective Speedup |
|---|---|---|---|---|---|
| 4K | 2x | 0.999 | 14.0 | 13.5 | +4% |
| 4K | 4x | 0.999 | 17.6 | 13.5 | +30% |
| 4K | 8x | 0.999 | 20.2 | 13.5 | +50% |
The current public quality story is strongest for the select pipeline. These are the exact heatmap values published in the benchmark note.
| Model | 4K / 2x | 4K / 4x | 4K / 8x | 8K / 2x | 8K / 4x | 8K / 8x |
|---|---|---|---|---|---|---|
| Qwen3-8B | 0.999 | 0.999 | 0.997 | 0.999 | 0.998 | 0.997 |
| Qwen3-30B-A3B | 0.999 | 0.999 | 0.999 | — | — | — |
| Qwen3-14B | 0.995 | 0.996 | 0.992 | 0.999 | 0.997 | 0.973 |
| DeepSeek-R1-14B | 0.999 | 0.998 | 0.993 | — | — | — |
The iterative context-extension result is real and worth showing, but it needs to be framed carefully. It is a specific iterative on-policy test from docs/CHANGELOG.md, not the default single-pass benchmark configuration used in the runtime comparison note above.
The CHANGELOG reports a 50x compression run on Qwen3-30B-A3B with 0.9967 cosine and 10/10 fact recall, plus 256K effective context from 64K physical memory across 49-57 compaction cycles. That is a specific iterative refinement result. It should be read as an existence proof for the technique, not as the default operating point of the published runtime benchmark suite.
Source: docs/CHANGELOG.md.
docs/HIGHLIGHTS.md.