February 27, 2026 in Engineering by Unbound Force8 minutes
Most teams gate their CI pipelines on line coverage. The metric is simple: what percentage of your code was executed during tests? But execution is not verification. A test suite can achieve 90% line coverage without containing a single meaningful assertion — every function runs, but nothing checks whether the results are correct.
This is not a theoretical concern. Research by Zhang and Mesbah analyzing 6,700 test suites across major open-source Java projects found that assertion coverage — the fraction of code whose output is actually evaluated by an assertion — is the strongest predictor of fault-detection effectiveness, far more than line or branch coverage alone. Separate work on “checked coverage” (which traces backward from assertions to the code that influenced them) has shown that a codebase boasting 85% statement coverage can have as little as 20% checked coverage. The gap between “code that ran” and “code whose behavior was verified” is enormous.
Line coverage creates two additional problems:
It punishes refactoring. When you restructure internal logic — extract a helper, merge functions, optimize an algorithm — line coverage numbers shift even if external behavior is unchanged. Worse, tests that assert on implementation details (mock call counts, internal method invocations, intermediate state) break on every refactor, even when the function still does exactly what it is supposed to do.
It incentivizes the wrong behavior. When teams are told to raise coverage by 10%, the easiest path is writing assertion-free tests that exercise code paths without checking outcomes. The coverage number goes up. Confidence in the test suite should not.
Gaze introduces Contract Coverage: the ratio of a function’s design responsibilities that its tests actually verify.
Contract Coverage = contractual side effects asserted on / total contractual side effectsThe key insight is that not every observable effect of a function matters equally. Some effects are the reason the function exists — its contractual obligations to the rest of the system. Others are implementation details that could change during a refactor without affecting correctness. Gaze distinguishes between them.
Gaze operates as a static analysis tool for Go. Its core side effect detection and contract classification require no test execution — they work entirely from source code. (The crap subcommand can optionally invoke go test -coverprofile to generate line coverage data for the traditional CRAP formula, but the contract analysis itself is purely static.)
Gaze works in three stages:
Side Effect Detection. Gaze analyzes a function’s AST and SSA representation to enumerate every observable side effect: return values, error returns, pointer mutations, file I/O, database writes, channel operations, goroutine spawns, logging, and more — organized into a 5-tier taxonomy (P0 through P4) covering 38 distinct effect types.
Contract Classification. Each detected side effect is classified as contractual (part of the function’s design responsibility), incidental (implementation detail), or ambiguous. Classification uses deterministic mechanical signals — interface satisfaction, caller dependency analysis within the module, exported API surface visibility, naming conventions, and godoc comments — each contributing a weighted score to a 0-100 confidence value. Optionally, an external workflow using OpenCode agents can analyze project documentation (architecture docs, specs, READMEs) to refine classification confidence further — this is not a built-in feature of the Gaze binary, but a supplementary process that feeds refined classifications back into the analysis.
Assertion Mapping. Gaze maps each test assertion back to the side effect it verifies, then computes the ratio. It also computes an Over-Specification Score: the count of incidental side effects the test asserts on. These are the assertions that will break during a refactor even when the function’s contract is preserved.
Contract Coverage tells you whether the test verifies what the function is supposed to do. A score of 100% means every contractual obligation is checked. A score of 50% means half of the function’s responsibilities are unguarded.
Over-Specification Score tells you how fragile the test is. A score of 0 means the test only asserts on contractual effects — it will survive any refactor that preserves behavior. A score of 3 means three assertions are tied to implementation details and will break when internals change.
Neither metric cares how many lines of code were executed.
Consider a Go function:
func SaveUser(ctx context.Context, db *sql.DB, user *User) error {
log.Printf("saving user %s", user.ID)
_, err := db.ExecContext(ctx, "INSERT INTO users ...", user.Name, user.Email)
if err != nil {
return fmt.Errorf("save user: %w", err)
}
go invalidateCache(user.ID)
return nil
}Gaze detects four side effects:
| Side Effect | Type | Classification |
|---|---|---|
Return nil on success | ReturnValue | Contractual |
| Return wrapped error on failure | ErrorReturn | Contractual |
| Database INSERT | DatabaseWrite | Contractual |
| Log message | LogWrite | Incidental |
The goroutine spawn and log write are implementation details — the system’s callers depend on the database write happening and the error being returned, not on whether a log line was emitted or how cache invalidation is scheduled internally.
Scenario A: The test checks the error return and queries the database to verify the row exists.
Scenario B: The test only checks err == nil.
Scenario C: The test checks err == nil and also asserts on the log output.
Line coverage would report the same number for all three scenarios. Contract Coverage reveals that Scenario B and C are dangerously undertested, and that Scenario C is also brittle.
Gaze also computes composite risk scores. The industry-standard CRAP formula combines cyclomatic complexity with line coverage:
CRAP(m) = complexity^2 * (1 - lineCoverage/100)^3 + complexityGaze introduces GazeCRAP, which substitutes contract coverage for line coverage:
GazeCRAP(m) = complexity^2 * (1 - contractCoverage/100)^3 + complexityFor SaveUser with complexity 5:
| Metric | Scenario A (100% contract cov, 90% line cov) | Scenario B (33% contract cov, 90% line cov) |
|---|---|---|
| CRAP | 5.0 | 5.0 |
| GazeCRAP | 5.0 | 12.5 |
Traditional CRAP says both scenarios look safe. GazeCRAP reveals that Scenario B carries real risk — the function is complex enough to matter, and its contractual obligations are mostly unverified.
Gaze classifies functions into four quadrants based on both scores:
| Approach | What It Measures | Refactoring Resilience | Runtime Cost | Scope |
|---|---|---|---|---|
| Line/Branch Coverage | Code execution paths | Low — shifts on any structural change | Low (instrumentation) | Unit level |
| Mutation Testing | Assertion strength via fault injection | Moderate-High — validates post-facto | High (recompile per mutant) | Unit level |
| Checked Coverage | Dynamic backward slice from assertions | Moderate — traces data flow | High (execution tracing) | Unit level |
| API Contract Testing (Pact) | Network boundary agreements | High — decoupled from internals | Low | Service boundaries only |
| Gaze Contract Coverage | Design responsibility verification | High — anchored to role, not code | None for contract analysis; optional test execution for CRAP scoring | Unit level |
Key differentiators:
Static at its core. Gaze’s side effect detection and contract classification analyze source code without running tests — no runtime overhead, no instrumentation, no recompilation. The crap subcommand can optionally run go test to collect line coverage data for the traditional CRAP formula, but the contract analysis pipeline itself is purely static. Gaze works alongside your existing test runner.
Proactive, not reactive. Mutation testing validates assertion strength by injecting faults after tests exist. Gaze establishes what the assertions should cover based on the function’s design role, identifying gaps before they become escaped defects.
Unit-level contracts. Tools like Pact enforce contracts at API boundaries between services. Gaze pushes the contract paradigm down to individual functions and methods within a codebase.
Refactoring-resilient by design. Because the metric is anchored to the function’s role (what it is responsible for) rather than its code lines (how it does it), Contract Coverage remains stable across refactors that preserve behavior. Rewrite the internals of SaveUser entirely — as long as it still writes to the database and returns errors correctly, the score does not change and the tests do not break.
Gaze provides threshold flags for pipeline enforcement:
# Fail the build if any test covers less than 80% of its target's contract
gaze quality ./internal/analysis --min-contract-coverage=80
# Fail if any test asserts on more than 2 incidental side effects
gaze quality ./internal/analysis --max-over-specification=2
# Fail if more than 5 functions are in the CRAP danger zone
gaze crap ./... --max-crapload=5
# Same threshold, but using contract coverage instead of line coverage
gaze crap ./... --max-gaze-crapload=3Without threshold flags, Gaze runs in report-only mode (exit code 0) — useful for initial adoption and exploration before enforcing gates.
Gaze also includes a self-check command that runs its entire analysis pipeline against its own codebase, serving as both a dogfooding mechanism and a project health dashboard.
A --min-contract-coverage=80 gate answers a fundamentally different question than a --min-coverage=80 gate. The traditional gate asks: “Did the tests touch 80% of the code?” Gaze’s gate asks: “Do the tests verify at least 80% of what each function is responsible for?”
The first question can be satisfied by assertion-free tests that game the metric. The second cannot. Contract Coverage requires that test assertions exist and that they map to the function’s actual design obligations — the side effects that the rest of the system depends on. There is no way to inflate the number without writing tests that genuinely verify behavior.
Combined with the Over-Specification gate, teams get a two-sided quality signal: tests must verify enough of the contract (coverage floor) while avoiding assertions on implementation details (brittleness ceiling). Functions that satisfy both constraints produce tests that are simultaneously thorough and refactoring-resilient — the precise outcome that line coverage was always supposed to deliver but structurally cannot.