Augmentic BV
Haaswijkweg east 12B
3319 GC Dordrecht
The Netherlands
Augmentic BV
Haaswijkweg east 12B
3319 GC Dordrecht
The Netherlands
AI agents sometimes outperform human experts in R&D tasks, METR's RE-Bench benchmark shows. Especially in short, technical tasks, models such as Claude 3.5 and o1-preview excel. The benchmark helps track the progress and risks of autonomous AI in research - essential for responsible use in R&D.
The Article "Evaluating frontier AI R&D capabilities of language model agents against human experts" introduces RE-Bench, a new benchmark developed by METR to measure the performance of both human experts and advanced AI models in machine learning (ML) research and engineering. This benchmark includes seven specific environments, each focused on a unique research task, such as deriving a scaling law or optimizing a GPU kernel. The tasks were carefully selected in collaboration with ML researchers from both academia and industry to ensure realism and diversity.
The evaluation compared AI agents, such as Anthropic's Claude 3.5 Sonnet and OpenAI's o1-preview, with human experts. At a two-hour time limit, AI agents generally outperformed human participants. However, at longer time budgets, the human experts took the lead, suggesting that AI agents have difficulty effectively utilizing longer time periods and responding appropriately to new information.
A striking result was observed in a task where the agents had to write a custom kernel to reduce the execution time of a prefix sum operation. The o1-preview agent developed a solution that outperformed the best human score by implementing innovative CUDA kernels and testing different parameters. This shows that although AI agents face challenges in adapting to new information over long periods of time, they are able to generate efficient and sophisticated solutions with minimal guidance.
The development of benchmarks such as RE-Bench is crucial for monitoring the progress and potential risks of autonomous AI in research and development environments. Both the White House and the EU have emphasized the importance of evaluating AI capabilities in R&D contexts, as stated in the National Security Memorandum on AI and the EU Artificial Intelligence Act, respectively. Through the open-source availability of RE-Bench and associated data, METR hopes to contribute to the creation of assessments for identifying dangerous levels of autonomous AI R&D capabilities.
Source : Evaluating frontier AI R&D capabilities of language model agents against human experts