Quality assurance

This fork is doing risky runtime work: KV cache mutation, compacted-prefix execution, server endpoints, and long-session state handling. The public QA story needs to show how that work is reviewed and tested, not just that the benchmark numbers look good.

The point of this page is not to claim perfection. It is to show the review posture, test layers, and bug history that make the published benchmark claims more credible.

Documented bug fixes

34 distinct fixes documented in docs/BUGS-AND-FIXES.md.

Critical / major fixes

Crashes, quality collapse, stale-state, and performance faults caught and fixed.

C++ tests

54 main-label CTest entries and 74 total CTest entries in the current Metal build.

Test tiers

From unit/contract checks through performance and live dashboard reporting.

Primary QA sources: docs/HIGHLIGHTS.md, docs/BUGS-AND-FIXES.md, and docs/UPSTREAM-SYNC.md.

Adversarial review protocol

The review posture here is not “does the code look reasonable?” It is “assume there is a bug and try to find it.” That matters for compaction because the dangerous failures are not always obvious. They often look like silent context loss, wrong counts in metrics, or a crash that appears one request after a reclaim step.

Review area	Why it matters
Scope and contract checks	Verify the implementation matches the declared slice and does not drift across repo or API boundaries.
Concrete execution traces	Force the reviewer to walk production, boundary, adversarial, security, and concurrency traces with real values.
State-machine review	Check before / during / after compaction and reclaim, especially for prefix matching and save/restore state.
Disprove-it pass	Try to break indexing, layout assumptions, integer math, and memory ownership instead of re-affirming the happy path.
Cross-repo contract review	Server endpoints, metrics, and runtime capability reporting have to match the fork and downstream consumers.

Testing layers

docs/HIGHLIGHTS.md describes eight engine test tiers. The important point is that this is not a single benchmark script. The QA stack spans API behavior, performance regression, quality gates, Windows builds, and live reporting.

Layer	Purpose
C++ and server tests	Catch local runtime regressions early and keep the main code paths shippable.
API and contract checks	Keep endpoints, metrics, and capability surfaces aligned with the implementation.
Performance regression checks	Watch decode speed, memory, and compaction latency so fixes do not quietly degrade hot paths.
Quality gates	Use cosine and related checks to keep compaction quality inside a defensible envelope.
Platform coverage	Include Windows and the published Apple Silicon path, not just a single local machine.
Live dashboards and reporting	Make benchmark drift visible instead of hiding it in ad hoc local runs.

CI workflow set

The repo keeps a focused set of active workflows for build, server smoke, performance, Windows validation, dashboard/reporting, and upstream maintenance. Exact workflow composition can evolve, but the public principle is stable: upstream changes and runtime changes do not reach the main branch without automated gates.

Workflow	Role
`modelai-ci`	Core build and test gate.
`modelai-server-smoke`	Server-path smoke coverage.
`modelai-perf-smoke`	Performance regression guard.
`modelai-ci-windows`	Windows MSVC validation.
`modelai-dashboard`	Benchmark aggregation/reporting path.
`modelai-upstream-sync`	Scheduled upstream merge flow with gating.
`modelai-upstream-watch`	Weekly upstream issue and KV-watch reporting lane.
`modelai-stale`	Repository hygiene for long-lived issues.
`modelai-auto-label`	Repository hygiene and routing support.

Upstream sync process

The fork is not maintained as a dead snapshot. docs/UPSTREAM-SYNC.md documents a weekly upstream sync, with CI and manual review gates when the upstream delta touches sensitive areas such as KV cache management, graph construction, or server behavior.

Automated part

Fetch upstream, merge into the sync lane, run build and test gates, and stop if the change set touches risky runtime surfaces or fails checks.

Manual part

Review KV-cache, graph, architecture-support, and endpoint changes before allowing them onto modelai-main.

Recent upstream issues fixed locally

The site wording here follows fork status, not upstream issue closure. As of 2026-04-14, the issue numbers below are still open upstream, but the fixes ship in modelai-llama.cpp and are documented with local commit SHAs.

Upstream issue	Local status	Fork commit	What changed in the fork
#20260	Fixed locally, open upstream	`edba5ae49`	Required-tool PEG parses now tolerate short bridge text before the first `<tool_call>`.
#21384	Fixed locally, open upstream	`edba5ae49`	Schema-declared arrays and objects are coerced back from stringified JSON on completed PEG parses.
#21450	Fixed locally on Metal FA path, open upstream	`d9fa2c6ff`	Practical mixed `q8_0` with `q4_0/q4_1/q5_0/q5_1` K/V pairs now run on Metal `FLASH_ATTN_EXT`.
#21919	Fixed locally, open upstream	`edba5ae49`	Qwen3.5 long-input tokenization no longer falls through to the stack-overflowing `std::regex` path.
#21831 / #21903	Fixed locally, open upstream	`edba5ae49`	Short hybrid/SWA sessions now retain reusable checkpoints instead of forcing full prompt re-processing on the next turn.

Why this matters

Compaction can fail in ways users do not immediately see. A bad reclaim step can look like amnesia one request later. A bad state counter can make metrics lie about what is active. A host/device mismatch can turn into decode stalls at longer contexts. That is why the QA story here needs to be operational, not decorative.

Evidence	Where it is documented
34 distinct bug fixes	docs/BUGS-AND-FIXES.md
Benchmark summary and test counts	docs/HIGHLIGHTS.md
Sync workflow, local compatibility fixes, and manual gate	docs/UPSTREAM-SYNC.md