Quality assurance

This fork is doing risky runtime work: KV cache mutation, compacted-prefix execution, server endpoints, and long-session state handling. The public QA story needs to show how that work is reviewed and tested, not just that the benchmark numbers look good.

The point of this page is not to claim perfection. It is to show the review posture, test layers, and bug history that make the published benchmark claims more credible.
Documented bug fixes
29
29 distinct fixes documented in docs/BUGS-AND-FIXES.md.
Critical / major fixes
18
Crashes, quality collapse, stale-state, and performance faults caught and fixed.
C++ tests
53
53 CI-gated C++ tests, 66 total tests listed in docs/HIGHLIGHTS.md.
Test tiers
8
From unit/contract checks through performance and live dashboard reporting.

Primary QA sources: docs/HIGHLIGHTS.md, docs/BUGS-AND-FIXES.md, and docs/UPSTREAM-SYNC.md.

Adversarial review protocol

The review posture here is not “does the code look reasonable?” It is “assume there is a bug and try to find it.” That matters for compaction because the dangerous failures are not always obvious. They often look like silent context loss, wrong counts in metrics, or a crash that appears one request after a reclaim step.

Review area Why it matters
Scope and contract checksVerify the implementation matches the declared slice and does not drift across repo or API boundaries.
Concrete execution tracesForce the reviewer to walk production, boundary, adversarial, security, and concurrency traces with real values.
State-machine reviewCheck before / during / after compaction and reclaim, especially for prefix matching and save/restore state.
Disprove-it passTry to break indexing, layout assumptions, integer math, and memory ownership instead of re-affirming the happy path.
Cross-repo contract reviewServer endpoints, metrics, and runtime capability reporting have to match the fork and downstream consumers.

Testing layers

docs/HIGHLIGHTS.md describes eight engine test tiers. The important point is that this is not a single benchmark script. The QA stack spans API behavior, performance regression, quality gates, Windows builds, and live reporting.

Layer Purpose
C++ and server testsCatch local runtime regressions early and keep the main code paths shippable.
API and contract checksKeep endpoints, metrics, and capability surfaces aligned with the implementation.
Performance regression checksWatch decode speed, memory, and compaction latency so fixes do not quietly degrade hot paths.
Quality gatesUse cosine and related checks to keep compaction quality inside a defensible envelope.
Platform coverageInclude Windows and the published Apple Silicon path, not just a single local machine.
Live dashboards and reportingMake benchmark drift visible instead of hiding it in ad hoc local runs.

CI workflow set

The repo keeps a focused set of active workflows for build, server smoke, performance, Windows validation, dashboard/reporting, and upstream maintenance. Exact workflow composition can evolve, but the public principle is stable: upstream changes and runtime changes do not reach the main branch without automated gates.

Workflow Role
modelai-ciCore build and test gate.
modelai-server-smokeServer-path smoke coverage.
modelai-perf-smokePerformance regression guard.
modelai-ci-windowsWindows MSVC validation.
modelai-dashboardBenchmark aggregation/reporting path.
modelai-upstream-syncScheduled upstream merge flow with gating.
modelai-auto-labelRepository hygiene and routing support.

Upstream sync process

The fork is not maintained as a dead snapshot. docs/UPSTREAM-SYNC.md documents a weekly upstream sync, with CI and manual review gates when the upstream delta touches sensitive areas such as KV cache management, graph construction, or server behavior.

Automated part
Fetch upstream, merge into the sync lane, run build and test gates, and stop if the change set touches risky runtime surfaces or fails checks.
Manual part
Review KV-cache, graph, architecture-support, and endpoint changes before allowing them onto modelai-main.

Why this matters

Compaction can fail in ways users do not immediately see. A bad reclaim step can look like amnesia one request later. A bad state counter can make metrics lie about what is active. A host/device mismatch can turn into decode stalls at longer contexts. That is why the QA story here needs to be operational, not decorative.

Evidence Where it is documented
29 distinct bug fixesdocs/BUGS-AND-FIXES.md
Benchmark summary and test countsdocs/HIGHLIGHTS.md
Sync workflow and manual gatedocs/UPSTREAM-SYNC.md