Agent PR Benchmark

Status: in progress · Last updated: 2026-06-10

Generic AI review claims are noise. CodeVetter is building a public, hand-labeled set of real agent-generated diffs so catch-rate numbers mean something.

Methodology

20–30 public PRs where an agent made the primary code changes
Each finding hand-labeled: bug, regression, style-only, or false positive
Tools scored on recall@severity, not vibes
Dataset published when v1 threshold (≥10 cases) is met

Contribute

Have an agent PR with a known bug the reviewer missed? Open an issue with a link to the diff (no proprietary code required — sanitized excerpts welcome).

Try CodeVetter

Download for macOS · GitHub