REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
New benchmark evaluates whether agentic AI can reliably assess reproducibility in social science papers, revealing key strengths and failure modes.
Agentic AI is infiltrating research pipelines, but can it trustworthily verify reproducibility? REPRO-Bench tests this head-on with 100+ social science studies, tasking agents with re-running analyses, spotting errors, and flagging irreproducibility. Results? Top agents hit ~60% accuracy but falter on stats-heavy tasks, costing far more than simple baselines.
What changed. REPRO-Bench is the first agent benchmark for social science reproducibility, showing agents beat baselines but need better statistical reasoning.
Why it matters. As agents automate research QA, unreliable reproducibility checks risk propagating bad science at scale.
Builder takeaway. Use REPRO-Bench to harden your agents for research tasks—focus on cheap, specialized verifiers over bloated generalists. Paper