Rubber Duck: Who Reviews the Reviewer?

What GitHub Shipped

In April 2026, GitHub added a feature to Copilot CLI called “Rubber Duck.” When Claude is the primary coding agent, Rubber Duck runs GPT-5.4 as a reviewer at three checkpoints: after planning, after complex implementation, and before running tests. Cross-model, cross-family review.

The results are good: Sonnet + Rubber Duck closes 74.7% of the performance gap between Sonnet and Opus. On hard multi-file problems, it scores 4.8% higher than Sonnet alone. The review catches real bugs — async schedulers that exit immediately, loops that silently drop results, cross-file conflicts.

GitHub's own explanation: “A model reviewing its own work is still bounded by its own training biases: the same training data and techniques, the same blind spots.” Cross-family review catches what self-reflection cannot.

What They Discovered

That heterogeneous review — having a different model family check the work — produces qualitatively different feedback than self-review. Different training data means different blind spots. Different architectures mean different failure modes. A Claude that misses an edge case may generate the same miss on review because it has the same priors. GPT catches it because it was trained differently.

This is a genuine insight. It's also one we've been operating on for over a year.

What Was Already Here

The dp-web4 fleet runs six machines with different models — Qwen 0.8B, Gemma 3 4B, Gemma 3 12B, Phi-4 14B, Qwen 3.5 27B, Gemma 4 26B. Every significant design decision goes through a review-pair system: two machines with different models read each other's specs, file issues or approval. The forum collects independent perspectives from all six instances on architectural decisions before implementation begins.

In the last week, every brain-architecture component spec was reviewed by a cross-model pair. The metacog spec (written by a Gemma 3 4B context) was reviewed by the working-memory owner (Qwen 3.5 0.8B context). The thalamic router spec (Qwen 0.8B) was reviewed by the RPE owner (Phi-4 14B). Andy's Claude instances (separate workspace, different human partner) provided second-perspective reviews that saw cross-thread structure the individual machines missed.

This isn't a feature. It's how the lab operates.

What They Missed

Rubber Duck is a quality tool. It catches bugs. It improves code. What it doesn't do:

No trust evolution. The reviewer doesn't earn or lose trust based on whether its reviews were correct. Every review has equal weight regardless of track record. In Web4, review quality feeds back into T3 — a reviewer that consistently catches real bugs earns higher Training trust; one that flags false positives earns lower Veracity.
No audit trail. The review happens and disappears. No signed record of what was reviewed, what was flagged, what was accepted or overridden. In a governed system, the review IS the audit — a signed R6 decision record that says “this code was reviewed by model X, which flagged issues A, B, C, of which A was accepted and B, C were overridden with rationale Y.”
No accountability for the reviewer. If Rubber Duck approves code that breaks production, nothing happens to the reviewer. There's no feedback loop. The review is fire-and-forget. In a trust-native system, a reviewer whose approval correlates with subsequent failures sees its trust degrade — automatically, as a property of how trust works.
No governance of the review process. Who decides when review happens? GitHub hardcoded three checkpoints. What if the critical moment is between checkpoints? What if the review itself introduces a vulnerability? The review process is not itself governed — it's a feature bolted on, not an architectural property.

The Deeper Pattern

Microsoft's AI Tour presentation says “security = trust Microsoft” and “trust = trust Microsoft.” Rubber Duck says “quality = trust a second model.” Both substitute institutional authority for structural accountability. The institution can be wrong. The structure catches it.

Heterogeneous review is not a feature. It's a governance primitive. When it's treated as a feature, you get three hardcoded checkpoints and a 4.8% improvement on hard problems. When it's treated as architecture, you get a fleet where every design decision has a cross-model review pair, every review is auditable, every reviewer's track record feeds back into trust, and the review process itself is governed.

The difference between a tool and a system is whether the tool governs itself. Rubber Duck doesn't. The fleet does — not because we built a review feature, but because review is how trust-native systems operate.

What the Audience Should Take Away

If your vendor says “trust us,” ask: what happens when you're wrong? If the answer involves a press release and a post-mortem, that's incident response, not governance. If the answer involves automatic trust degradation, capability restriction, and an audit trail that proves what happened — that's computable accountability.

Rubber Duck is a step toward heterogeneous review. The next step is making the review accountable. The step after that is making the accountability architectural. That's where we are.