Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam BrownNo Priors: Artificial Intelligence | Technology | Startups

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

36分钟 ·
播放数20
·
评论数0

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah Guo about his latest essay on why the AI industry’s traditional benchmark grids are broken, and how large-scale test-time compute is fundamentally changing how models are evaluated. Noam explains how, if properly scaffolded, today’s models can reason for weeks or even months on complex tasks. He also discusses real-world implications of test-time compute, from building poker solver bots to disproving legendary math conjectures. Together, they also unpack the large gaps in current AI safety frameworks, explore the bottlenecks for recursive self-improvement, and look ahead at the future of multi-agent collaboration and global knowledge sharing.

Read more: Implications of Large-Scale Test-Time Compute

Sign up for new podcasts every week. Email feedback to show@no-priors.com

Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @polynoamial | @OpenAI

Chapters:

00:00 – Cold Open

00:43 – Noam Brown Introduction

01:23 – Why Benchmarks Are Broken

04:19 – Compute Budgets and Projections

05:34 – How Long Should Models Think?

06:47 – Benchmark-Maxxing

08:34 – Using Poker Bots as Evals

11:26 – Safety Evals When Model Capability Scales With Budget 

14:41 – Release Cycle vs. Agent Runtime 

17:06 – Latent Model Capability 

20:59 – Limits on Recursive Self-Improvement

27:09 – Large-Scale Multi-Agent Coordination 

29:11 – Competition at the Frontier 

31:51 – Breaking the Benchmark Grid Equilibrium 

33:29 – Why Benchmarks Should be Evaluated by Cost

36:18 – Conclusion