What AI Review and Tests Both Missed

TL;DR: We built the nold-ai/specfact-code-review module using 10 structured OpenSpec change proposals. While doing it, we encountered exactly the category of issues the module is designed to catch: conflicts invisible to LLM-based review and undetectable by unit tests. Four cross-cutting structural conflicts surfaced during overlap analysis — none of them would have been caught by AI review, quality gates, or tests alone. The irony was instructive.

The Setup: Building the Reviewer with the Reviewer

There's something satisfying about dogfooding at the design level.

We set out to build nold-ai/specfact-code-review — a new installable SpecFact CLI module that enforces code quality through a multi-layer analysis stack (ruff, basedpyright, pylint, radon, semgrep, icontract, crosshair), assigns numeric reward/penalty scores, persists a quality ledger in Supabase, and auto-updates a house_rules.md skill file that feeds back into every future AI coding session.

The module was developed following our own OpenSpec workflow: 10 structured change proposals (SP-001 through SP-009, plus infrastructure), each with explicit acceptance criteria, tdd-first flags, contract requirements, and file inventories. Standard discipline. Standard tooling.

And yet — before a single line of implementation was written — a pre-implementation overlap analysis surfaced four structural conflicts that no automated tool in our pipeline would have caught.

The Conflicts AI Review Didn't Catch

Conflict 1: Parallel JSON Schema Ownership

SP-001 proposed a ReviewReport Pydantic model with its own verdict, score, and reward_delta fields. Clean, self-contained, reasonable design.

What it didn't account for: an existing pending change (governance-01-evidence-output) had already staked out ownership of the canonical evidence JSON envelope — schema_version, run_id, timestamp, overall_verdict, ci_exit_code — and explicitly declared itself authoritative for the envelope contract.

Two independent JSON schemas for the same evidence concept. LLM review would have looked at SP-001 in isolation and called it fine. Tests would have passed. The conflict only becomes visible when you look across the full change graph.

Resolution: ReviewReport was redesigned as a governance-01-compliant extension — wrapping review-specific fields (score, reward_delta, findings[]) inside the standard envelope rather than duplicating it. A blocked-by: [governance-01-evidence-output] dependency was added.

Conflict 2: Parallel Skill Infrastructure

SP-007 proposed creating .claude/skills/house-rules.md directly — a reasonable shortcut to get the house rules context file into Claude Code sessions.

What it bypassed: two existing changes (ai-integration-01-agent-skill and ai-integration-04-intent-skills) had already defined the canonical skill file standard — skills/specfact-*/SKILL.md, installed via specfact ide skill install --type <name>, following the open Agent Skills format (~80 tokens at rest, full instructions on activation).

A skill file dropped directly into .claude/skills/ would have worked technically but violated the installation contract. An AI reviewer reading SP-007 in isolation would have no reason to flag it. Tests couldn't detect it. Only the cross-change overlap analysis revealed the conflict.

Resolution: SP-007 was rerouted to create skills/specfact-code-review/SKILL.md in the specfact-cli repo, installed via specfact ide skill install --type code-review — consistent with the established pattern.

Conflict 3: Duplicate Ownership of CLAUDE.md

SP-007 also proposed modifying CLAUDE.md to reference the house rules skill. Reasonable. Direct. Effective.

Except: ai-integration-03-instruction-files owns CLAUDE.md modifications as part of specfact ide setup --platforms cursor,copilot,claude,windsurf. Two change proposals writing to the same file via different paths is a merge-time collision waiting to happen — and a maintainability problem afterward.

An AI reviewer asked to look at either change individually would have no visibility into the other. The only way to catch this is to reason across the full set of in-flight changes simultaneously.

Resolution: The direct CLAUDE.md modification was removed from SP-007 entirely. The skill is picked up automatically when ai-integration-03 lands. No CLAUDE.md surgery needed from the review module.

Conflict 4: Missing Integration Requirements

This one wasn't a conflict — it was a gap. Three new CLI commands (specfact code review run, ledger, rules) were fully specified with acceptance criteria and file inventories. But nobody had flagged that every new command in the system needs:

Scenario YAML files in cli-val-01 format (happy paths + anti-patterns)
Anti-pattern entries in the cli-val-03 misuse safety catalog
CI gate exit code semantics aligned to cli-val-05's tiered gate model

Without the overlap analysis, SP-008 would have shipped without these scenario files. The acceptance test runner (cli-val-04) and CI gates (cli-val-05) would have had no coverage for the new commands.

Resolution: SP-008 was explicitly required to include cli-val-01 scenario YAML files for all three command groups, with exit codes (0/1/2) mapped to the tiered gate model.

Why These Weren't Caught by Standard Review

This is the part worth understanding.

AI-based review (Claude, Copilot, etc.) evaluates code and design proposals in context windows. It's excellent at spotting logic errors, type mismatches, missing error handling, and style violations within a given file or PR. What it cannot do reliably is maintain a complete graph of in-flight changes across multiple change proposals and reason about cross-proposal ownership conflicts. The context window is too narrow and the change graph too implicit.

Unit tests and quality gates run against implementation. They catch runtime bugs, contract violations, coverage gaps, and complexity violations. They tell you nothing about design-level conflicts that exist before a line of implementation is written.

Manual PR review can catch these — if reviewers happen to know about the other in-flight changes and are looking for conflicts. In practice, reviewers are focused on the PR in front of them.

What caught all four conflicts was structured overlap analysis — a deliberate, systematic pass across the full set of active change proposals before implementation started. It's the kind of analysis a senior architect does on a whiteboard before a sprint. The difference is that OpenSpec change proposals make this analysis mechanical: the full change graph is in structured files, so the analysis can be done explicitly and documented.

Review Method	Conflict 1 (parallel schema)	Conflict 2 (parallel skill path)	Conflict 3 (CLAUDE.md)	Conflict 4 (missing scenario files)
AI-based review (per-PR)	✗	✗	✗	✗
Unit tests / quality gates	✗	✗	✗	✗
Manual PR review	Maybe	Maybe	Maybe	Maybe
Structured cross-change overlap analysis	✓	✓	✓	✓

The Module Itself: What It Enforces

The nold-ai/specfact-code-review module enforces code quality across four layers:

Layer	Tools	What It Catches
Static analysis	ruff (C90, S/Bandit), radon, basedpyright strict	Complexity, security, type contracts
Architecture	pylint, semgrep custom rules	Bare except, print() in src/, cross-layer calls
Contracts	icontract AST scan, crosshair (2s fast pass)	Missing decorators, solver counterexamples
TDD gate	pytest-json, coverage.py	Missing test files, coverage < 80%

The scoring algorithm deducts per violation (-15 for blocking errors, -2 for warnings) and awards bonuses for clean runs (+5 for zero complexity violations, +5 for 90%+ coverage, etc.). The result is a reward_delta that feeds a persistent Supabase ledger and auto-updates a house_rules.md skill injected into every future coding session.

Exit codes map directly to CI gates:

0 PASS — score ≥ 70, no blocking violations
1 WARN — score 50–69, advisory only
2 BLOCK — score < 50 or any blocking violation

The Feedback Loop

The really interesting part is what happens over time.

Each code review run updates the reward ledger. The ledger tracks cumulative coins, pass/block streaks, and top violations by frequency. The specfact code review rules update command reads the last 20 runs and derives the current house_rules.md — a ≤35-line context file encoding the violations this codebase actually produces, updated automatically.

That file gets injected into every AI coding session. So the AI's behavior on the next session is shaped by what the review found in the last one. Violations that appear ≥3 times surface in the rules. Rules with zero hits for 10 consecutive runs get pruned.

The module doesn't just catch issues — it closes the loop between review findings and AI behavior, session over session.

What This Means for AI-Assisted Development

We built a code-review module and discovered, in the process of building it, exactly the class of issues it's designed to catch in code: structural conflicts that are invisible to LLMs and undetectable by tests.

The lesson isn't that AI review is useless. It's that AI review + tests + quality gates is not a complete stack. What's missing is the layer that reasons across the full change graph, across multiple proposals, across organizational ownership. That's what structured code review — whether done by a senior architect or enforced by tooling — adds.

The nold-ai/specfact-code-review module is one piece of that stack: the automated, per-run, per-file enforcement layer. The overlap analysis that found the four conflicts before implementation started is another piece: the architectural, cross-change reasoning layer.

Neither replaces the other. Both are necessary.

Getting Started

specfact module install nold-ai/specfact-code-review

# Run review on changed files
specfact code review run

# Pipe to ledger
specfact code review run --json | specfact code review ledger update

# Check your quality ledger
specfact code review ledger status

# Auto-update house rules from findings
specfact code review rules update

What AI Review and Tests Both Missed

The Setup: Building the Reviewer with the Reviewer

The Conflicts AI Review Didn't Catch

Conflict 1: Parallel JSON Schema Ownership

Conflict 2: Parallel Skill Infrastructure

Conflict 3: Duplicate Ownership of CLAUDE.md

Conflict 4: Missing Integration Requirements

Why These Weren't Caught by Standard Review

The Module Itself: What It Enforces

The Feedback Loop

What This Means for AI-Assisted Development

Getting Started

Further Reading

Ready to try SpecFact?