Evaluating AI Agents in Real Research

Introduction to Deep Research

As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not just answering simple factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the web, and synthesizing it into a coherent output. This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Extended Thinking”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to assess AI agents’ performance on multi-step, web-based research tasks. These aren’t simple questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings. The benchmark includes 89 distinct tasks across 8 categories such as:

Find Number: e.g. “How many FDA Class II medical device recalls occurred?”
Validate Claim: e.g. “Is ChatGPT 10x more energy-intensive than Google Search?”
Compile Dataset: e.g. “Job trends for US software developers from 2019–2023”

The Agent Architecture: ReAct and RetroSearch

At the heart of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle a problem—by thinking through the task, taking an action like performing a web search, observing the results, and then deciding whether to iterate or conclude. While earlier models follow this loop explicitly, newer “thinking” models often streamline the process, embedding reasoning more fluidly into their actions. To ensure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the web. Rather than relying on the live internet, which constantly changes, agents tap into a curated archive of web pages scraped using tools like Serper, Playwright, and ScraperAPI.

Which AI Agents Perform Best?

Among all the contenders, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that might sound modest, it’s essential to understand the benchmark’s difficulty: due to ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even the best models today still fall short of well-informed, methodical human researchers. The leaderboard offers revealing insights, with o3 not only leading the pack but doing so with speed and consistency, showing strong performance across nearly all task types.

Where Do Agents Struggle?

One of the most frustrating aspects of working with AI agents, especially during long research or content creation sessions, is when they simply forget what you were doing. As the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. That kind of forgetfulness isn’t just anecdotal—it’s the most significant predictor of failure in the Deep Research Bench evaluation. Other recurring issues include models falling into repetitive tool use, showing poor query crafting, and falling victim to premature conclusions.

What About Memory-Based Performance?

Deep Research Bench also evaluated what it calls “toolless” agents—language models operating without any access to external tools, such as web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they’ve previously learned during training. Interestingly, these toolless agents performed almost as well as full research agents on certain tasks, such as validating claims. However, on more demanding tasks that require piecing together multiple values from various sources or finding and evaluating diverse facts in context, these toolless models completely fell apart.

Conclusion

The Deep Research Bench report highlights the capabilities and limitations of current AI agents in performing deep research tasks. While the best models can outpace average humans on narrowly defined tasks, they still lag behind skilled generalist researchers, especially in planning strategically, adapting mid-process, and reasoning with nuance. The gap becomes especially obvious during long or complex sessions, where an agent gradually loses track of the task’s purpose, leading to a frustrating breakdown in coherence and utility. As LLMs continue to integrate into serious knowledge work, tools like Deep Research Bench will be essential for assessing not just what these systems know, but how well they actually work.

News

Useful Links

Evaluating AI Agents in Real Research

Introduction to Deep Research

What Is Deep Research Bench?

The Agent Architecture: ReAct and RetroSearch

Which AI Agents Perform Best?

Where Do Agents Struggle?

What About Memory-Based Performance?

Conclusion

Fixing Crypto’s UX Crisis with Intents for Agentic DeFi

Arnav Sahu Joins Peak XV After Y Combinator Stint

Coffee Boosts Healthy Aging By Up To 5% Per Cup

2 days left to save on TechCrunch pass

Greenback 1.0

Related News

Meta Unveils Oakley Smart Glasses

The OpenAI Files: Understanding Sam Altman’s Company

Feeding AI Nothing

Google Tests AI Voice Chats in Search

Fixing Crypto’s UX Crisis with Intents for Agentic DeFi

Arnav Sahu Joins Peak XV After Y Combinator Stint

Coffee Boosts Healthy Aging By Up To 5% Per Cup