2026-02-02 | PreviewProof Team

The Verification Checklist for Autonomous Coding Agents

AI coding agentsverificationcode reviewautonomous agents

A coding agent finishes a task. The diff compiles, the type checker is happy, the test suite is green. The PR description reads plausibly. The reviewer pulls up the change and asks the question that no longer has a quick answer:

Is this correct?

For human-authored code, the diff was usually enough to tell. The author had context, the reviewer had context, and the diff was a faithful representation of what changed. None of those hold reliably for agent-authored code. The diff is the surface. Verification has to happen against something else.

This is the checklist we use. It’s designed to be usable as-is or adapted. None of these steps require special tooling — though most are dramatically faster when the preview environment workflow is set up well.

Step 1: Behavioral verification in a running preview

The first thing to do with an agent-authored PR is not to read it. It’s to run it.

Open the preview. Navigate to the feature the PR claims to implement. Click the buttons. Submit the form. Trigger the new code path.

Two failure modes show up that the diff never would have:

  • The feature doesn’t work end-to-end. A missing integration, a wrong env var, a mismatched API contract, or a forgotten migration means the running app behaves differently from what the diff suggests.
  • The feature works, but not the way the spec asked. The agent matched a pattern from training data instead of the specific intent.

If you can’t run the change, the rest of this checklist is theater.

Step 2: Intent matching against the original request

Pull up whatever captured the original request — ticket, Slack message, design doc, prompt. Then read what the PR says it did. Then compare what the running preview actually does. The three should match. They often don’t.

Classic failure: the request said “add a way to bulk-archive old projects.” The PR description says “Adds a bulk-archive action to the projects list.” The preview adds a bulk-archive button that archives selected projects — but the request was specifically about old projects, with an implicit age threshold, and the agent built generic bulk-archive without any age filter.

The diff is correct. The tests pass. The intent is wrong. Only intent-matching catches this. A free-form “does it match the intent?” is too easy to answer “yes” to. A specific “the bulk-archive UI exposes an age filter and defaults to projects older than 90 days” is harder to fake.

Step 3: Side-effect detection

Read the diff. Not for correctness — for scope.

Agents touch files they weren’t asked to touch. Look for:

  • Files modified outside the obvious scope. A feature PR that touches auth middleware. A UI change that modifies migration order. A doc fix with a CI config change. Each needs a conscious decision about whether the side-effect was intended.
  • Configuration drift. tsconfig.json, .eslintrc, package.json engines, Docker base images. Agents will quietly “modernize” these.
  • Schema or migration changes. Any schema change needs an explicit call-out — bad schema changes cost much more than bad UI changes.
  • Reformatting unrelated code. 30 lines of feature change buried in 270 lines of reformatting hides the actual review surface.

If the agent touched anything you didn’t ask it to touch, treat that as its own change and review it independently — or revert it.

Step 4: Regression awareness

Agents don’t reason about callers. They modify a function and trust the test suite to catch downstream breakage. If your coverage is incomplete (it is), regressions slip through.

Practical checks:

  • Search for callers of any modified function. Spot-check at least one. Does the new behavior still satisfy the caller’s expectations?
  • For UI changes, walk through adjacent flows. The agent fixed checkout — does the cart page still work? The order confirmation?
  • For database changes, check existing queries that hit the modified table.

This is where seed data quality matters most. With rich, behaviorally diverse seeds — see seed data for AI pull requests — regressions surface as you click around. Without them, they ship.

Step 5: Edge case probing

Spend three minutes deliberately trying to break it:

  • Empty state. Empty list, empty form, user with zero of the thing.
  • Maximum state. A thousand items, a 500-character name, a file at the upload limit.
  • Partial state. Half-completed records, blank optional fields, dependent records in unusual states.
  • Adversarial input. Quotes in inputs, emoji where the agent tested ASCII, URLs with extra slashes or malformed UUIDs.
  • Concurrency. Two tabs, same record. Does it degrade gracefully?

Agents reliably build for the happy path the prompt described. Edge case probing is the human’s job.

Step 6: Stakeholder alignment for UX-affecting changes

If the change touches anything a non-engineer cares about — UI, copy, flow, error messages — the engineer running this checklist is not the right final reviewer. The PM, designer, or requesting stakeholder needs to look at the running preview.

This is where most teams quietly skip. The engineer eyeballs the UI, decides it looks fine, and merges. Three days later the PM notices the button label is wrong. See structured stakeholder sign-off for AI-authored PRs — name the right humans for the right scope, give them a preview URL, don’t merge until they’ve signed off.

Putting the checklist to work

A checklist in a Google Doc gets forgotten. The checklist works when it’s structured into the review workflow itself — visible on the PR, attached to a specific preview deployment, with explicit pass/fail items.

That structure is what makes it survive at scale. Five PRs a week, in someone’s head is fine. Fifty PRs a week, it needs to be in the tool, with named approvers and captured evidence.

Concretely:

Pre-merge verification (PR #4231, preview-4231.app):
[x] Behavioral check: feature runs end-to-end (eng)
[x] Intent match: bulk-archive filters by 90-day age (pm)
[x] Scope check: no unintended file modifications (eng)
[x] Regression: adjacent flows still work (eng)
[x] Edge cases: empty list, 1000 items, concurrent edits (eng)
[x] Stakeholder review: UX matches design (design)
[x] Sign-off captured against image digest sha256:abc123

That’s the bar. Six items, each tied to a specific human, each with evidence. It’s not heavy. It’s the difference between “the agent shipped something” and “the team verified what the agent shipped.”

For the broader argument about why diff-reading and test-passing aren’t enough, see testing AI coding agent output beyond the diff and what verified software delivery means.

A short, honest plug

You can run this checklist with sticky notes and Slack messages. Plenty of teams do at first.

If you’d rather have the checklist live where the work happens — attached to the PR, tied to the specific preview, with named approvers and an evidence log that survives an audit — that’s what we built PreviewProof for. Per-PR previews, structured verification, captured sign-off. Preview it. Prove it.