2026-02-17 | PreviewProof Team

How to Test What an AI Coding Agent Built — Beyond the Diff

AI coding agentsverificationtestingverified software delivery

A coding agent finishes a task. The diff looks reasonable. The tests pass. CI is green. Someone clicks merge.

Was the change verified?

The honest answer most teams give is “verified enough.” Tests pass. A teammate skimmed the diff. Nobody saw an obvious problem. That was a reasonable bar when code was produced at human speed by humans with context. It’s no longer a reasonable bar, and the gap between what teams call verification and what verification actually requires is the most important quality problem in the industry right now.

This post is about what fills the gap. It’s deliberately opinionated.

Why “the tests pass” isn’t enough anymore

Tests verify that the code does what the test author expected. They are a regression net, not a correctness oracle.

When humans wrote both the code and the tests, the gap between “tests pass” and “code is correct” was bounded by the human’s understanding. If they got the intent right, they wrote tests that captured it. If they got it wrong, a code review by a second human would often catch the mismatch.

When an agent writes the code and the tests in the same session, the gap is unbounded. Both artifacts reflect whatever interpretation the model produced. The tests confirm the code. The code satisfies the tests. Coverage looks great. The agent’s understanding is internally consistent and externally wrong, and no second human is in the loop to notice.

This isn’t hypothetical. We’ve seen enough agent-authored PRs across enough teams: tests passing is uncorrelated with the change being correct the way the requestor wanted. Sometimes tests catch real bugs. Sometimes they ratify the agent’s misunderstanding. There’s no reliable way to tell which without independent verification.

Why “the diff looks fine” isn’t enough either

Agent-generated code looks confident. Idiomatic patterns, sensible variable names, consistent style. It reads like it was written by a senior engineer, because the model was trained on senior engineers. The reviewer’s pattern-matching brain comes back with “looks fine,” because the surface really does look fine.

The problem is that “looks fine” was a useful signal when human authorship guaranteed thoughtfulness. With AI authorship, surface plausibility is the default, not a signal of underlying correctness.

We’ve written about this in vibe coding without a safety net and AI writes the code, who tests it? — the surface signals reviewers used to rely on are no longer reliable.

What actually-good verification looks like

Verification has to happen against the running system, with realistic data, exercised by humans who have context about what was supposed to happen.

Behavioral testing in a preview environment with realistic seed data. The change is deployed to an isolated, ephemeral environment that looks enough like production for behavioral surprises to surface. Seed data exercises the feature surface — empty states, full states, edge cases, every major flag combination. See seed data for AI pull requests for what “rich enough” means.

Structured stakeholder review against original requirements. The PM, designer, or requestor looks at the running version and confirms it matches intent. Not a thumbs-up in Slack — a structured sign-off attached to the PR, against a specific deployable artifact. The verification checklist makes this concrete.

Explicit sign-off with evidence capture. Approval is named, timestamped, and tied to the artifact (image digest, not “the code around April 27”). The evidence is durable and exportable. This isn’t compliance theater — it’s the only way to maintain the verification standard when PR volume scales past what a small team can hold in their heads.

These three together are a reasonable working definition of verification for agent-authored code. They aren’t optional extras on top of “the tests pass.” They’re the verification that “the tests pass” was always supposed to be a substitute for, and never quite was.

The failure modes of the alternatives

Over-relying on tests. “We have 95% coverage.” Coverage measures what gets executed, not what gets verified. Agent-authored tests covering agent-authored code can hit 100% coverage and miss every intent error.

Over-relying on diff review. “We require two approvals.” Two engineers reading a diff for an agent-authored change are doing the same activity in parallel. Neither has more context than the other. Redundancy adds latency, not verification.

Conflating “merged” with “verified.” Git history says “merged on April 14.” That’s evidence someone clicked a button, not evidence of verification.

Treating “it deployed to staging fine” as verification. Staging is shared, stale, and where multiple changes get batched. “No one screamed” is not a signal.

Treating production as the test environment. Feature flags catch some classes of bugs and miss others — particularly intent mismatches, which show up as wrong people complaining six weeks later, not as exceptions.

The verification step has to be explicit

Verification, when implicit, decays. Teams start with informal verification, agent-authored PR volume increases, the informal version doesn’t scale, and the team unconsciously redefines “verified” downward to match what they can sustain.

The fix is to make verification explicit, structured, and measured. A specific named step in the workflow, with specific named people, against a specific running artifact, captured in a specific evidence log. This is what we mean by verified software delivery — a workflow upgrade that closes the gap between “AI shipped this” and “we verified what AI shipped.”

What this looks like in your week

Once the workflow is in place, the per-PR cost is lower than teams expect:

  • Per-PR previews deploy automatically; no engineer effort per PR.
  • Verification checklists come from the acceptance criteria for the change, rather than being reinvented each time.
  • Stakeholder reviewers click a link, walk through the feature, and sign off without context-switching into the diff.
  • Evidence — who reviewed what, against which artifact — is captured as a side-effect, not a separate paperwork step.

When this is in place, verification stops being the bottleneck and becomes the steering mechanism. Throughput goes up. Production incidents go down. Audits become tractable.

The honest opinion

Most teams shipping AI-authored code today are not verifying it. They’re hoping. The hope works often enough that it doesn’t feel like a problem — until it doesn’t, and the cost of missing verification shows up all at once: a production incident, a failed audit, a feature that quietly didn’t work for a month.

The verification gap is real. AI didn’t create it; AI made it visible. Closing it is the highest-impact quality work an engineering team can do in 2026.

A short, honest plug

You can build all of this — preview environments, structured review, evidence capture — yourself. We’ve written about most of the pieces. It’s a real project, but it’s tractable.

If you’d rather not, PreviewProof is what we built. Per-PR previews with realistic seed data, structured verification checklists, named stakeholder sign-off, and a tamper-evident evidence log. The verification layer between your code editor and your production deploy. Preview it. Prove it.