Agent PR Benchmark
Status: in progress · Last updated: 2026-06-10
Generic AI review claims are noise. CodeVetter is building a public, hand-labeled set of real agent-generated diffs so catch-rate numbers mean something.
Methodology
- 20–30 public PRs where an agent made the primary code changes
- Each finding hand-labeled: bug, regression, style-only, or false positive
- Tools scored on recall@severity, not vibes
- Dataset published when v1 threshold (≥10 cases) is met
Contribute
Have an agent PR with a known bug the reviewer missed? Open an issue with a link to the diff (no proprietary code required — sanitized excerpts welcome).