The case for peer review between AI agents

April 18, 20267 min read

Letting one coding agent write and approve its own work is fast. It also ships bugs.

The case for peer review between AI agents

Letting one coding agent write and approve its own work is fast. It also ships bugs. I've spent the last few months coding with AI agents in earnest — as part of actual delivery, not as a novelty — and the temptation is always the same: hand one agent the task, let it run, accept the diff. Most of the time it's fine. The times it isn't, I don't notice until later.

So I built a small open-source scaffold for the problem. The shape is straightforward: two agents in fixed roles, one builds and one reviews, the reviewer stays read-only, and a human owns the merge. No agent is allowed to approve its own code.

The human isn't reviewing either

Before getting into the design, there's an uncomfortable half of this argument that's worth naming.

A lot of teams tell themselves the fallback for agent-generated bugs is "we have humans review the code." But at the pace and volume modern coding agents produce code, that fallback is already quietly breaking. Diffs are long, context is thin, the model sounds confident, and by diff seven most people are reading headers and clicking through. I've done it. Every engineer I've spoken to honestly admits they've done it too.

So the real situation in most AI-assisted workflows isn't "agent writes, human reviews." It's "agent writes, agent rubber-stamps, human rubber-stamps" — and the review pass that was supposed to catch bugs barely exists. If the human can't hold the line-review role at agent speed, something else has to.

The contract, not the models

The first instinct when you wire two agents together is to let them talk — pass the conversation back and forth, let them negotiate, see what falls out. It doesn't work. What does work is a strict contract between the roles, where the builder and the reviewer never see each other's reasoning, only each other's artifacts:

Task spec. A human-written description of what to build, written once and never edited mid-loop.
Build report. A short JSON artifact the builder emits at the end of its run, listing what changed, what was tested, and what's still open.
The diff. The actual git diff, not a summary of it. This is the one item that can't be paraphrased.
Review report. Another JSON artifact, this one from the reviewer — findings tagged with severity, file path, line range, and a recommendation.
Human summary. A short markdown write-up at the end of the loop, for whoever is making the merge decision.

Pass only those between the roles. No free-form chat, no transcript forwarding, no "here's what the other agent said." The artifacts are the interface, and that constraint is the whole discipline.

The reason this matters: the failure mode of agent pairs isn't misunderstanding, it's collusion. Show the reviewer the builder's reasoning and it tends to agree. Hide it, and the review actually happens.

What the human actually does

In this design, the reviewer agent is the line-level reviewer. The human isn't — and that's the point.

What the human owns is the merge decision, made against a short structured summary: what was done, what the reviewer flagged, at what severity, and the reviewer's recommendation (approve, revise, or escalate). That's a scale of judgment a person can sustain even when agents are generating code continuously. If the structured review surfaces a blocker, the human sends it back or escalates; if it's clean, they ship.

Most attempts to "add AI to the workflow" get this backwards. They keep humans at the line-review layer that humans can't hold anymore, and quietly remove them from the judgment layer where they still add real value. The peer-review setup inverts that arrangement.

Role separation matters more than vendor separation

The obvious pairing is cross-vendor — one Claude, one Codex. That's how I started, and it's a good default because different model families fail differently, so a second opinion from a different training run will catch things a same-family pair would miss.

That said, same-vendor pairings work too. A Claude builder paired with a Claude reviewer still catches a lot, as long as the reviewer is properly constrained and the handoff is structured. The property that actually matters is role separation, not vendor separation. An agent acting as "reviewer" with a reviewer prompt, reviewer permissions, and no access to the builder's reasoning behaves differently from the same model acting as "builder" — even if the weights are identical.

There's also a wrinkle worth noting here: some agents are by nature meticulous, others are eager to ship. Pair a meticulous reviewer with an eager builder and you tend to get more catches than either temperament alone — which is part of why mixed-model pairings hold an edge in practice.

What I'd genuinely avoid is letting one agent play both roles in a single session. That's where the discipline breaks down.

What a review report actually looks like

The JSON schema for a review report has three sections: a short natural-language summary, a structured list of findings (each tagged with severity — blocking, nit, or question — plus a file path, a line range, and a suggested fix), and a final recommendation of approve, revise, or escalate.

The structure is what does the work. Natural-language review is easy to wave through; a blocking finding with a file path and line range attached to it is harder to ignore. If the reviewer wants to stop the loop, it has to say so in a structured field that the orchestrator actually reads.

Permissions and worktrees

A few practical choices in the scaffold matter more than they sound on first reading.

The builder runs with the minimum edit-capable permission, while the reviewer runs read-only by default — read_only for Codex, plan for Claude. This isn't about distrust of the reviewer; it's about making the review pass actually function as a review pass instead of quietly editing things on the side.

The orchestrator can also be told to run the whole loop in a disposable git worktree, so the main checkout is never touched until the human makes the merge decision. For riskier tasks I default to worktree mode on. For low-risk experiments I leave it off and accept the trade-off.

What I've relearned on my own projects

I've been working on Waypoint lately — one person, high agent leverage, moving quickly. It's the right environment to feel the cost of skipping review, because there's nobody else around to catch what you miss.

The pattern has been depressingly consistent. Every time I've let a single agent both implement a change and self-approve it, something has slipped through — a regression in a test that shouldn't have been touched, a missing null check on a code path I'd assumed was safe, a "helpful" refactor of a file the task didn't ask me to touch. None of these have been catastrophic on their own. All of them have been exactly the kind of thing a separate reviewer would flag in ten seconds.

The cost of adding the checkpoint is small. The cost of skipping it shows up later, usually in a place I wasn't looking.

Where this is going

Most of the current conversation about AI coding is about which model writes better code. I think that's the wrong frontier — the models are already good enough for most working engineers. The actual bottleneck is the review structure around them: who reviews what, at what pace, with what authority to stop the loop.

Over the next year I expect this becomes the real axis of differentiation between teams shipping with agents and teams just generating code with agents. The model is table stakes. The scaffolding is where the edge will be.

None of that is glamorous. It's the boring part of the stack — the part that turns agent-driven delivery from a demo into something you can actually ship.

The repo is on GitHub: sorinc03/agent-peer-review. If you're running multi-agent loops in earnest, especially in ways I haven't tried, I'd love to hear where yours is failing. That's the part I'm still learning.