5 min read

Google Opens Gemini Deep Research to Developers

Picture of Writing Team Writing Team : Dec 16, 2025 8:00:00 AM

Research Google Gemini

9:29

Google announced this week that Gemini Deep Research—its autonomous research agent—is now accessible to developers through the new Interactions API. The agent uses Gemini 3 Pro as its reasoning core, iteratively formulating search queries, identifying knowledge gaps, and synthesizing findings into comprehensive reports. Google claims it achieves state-of-the-art results on several benchmarks, including 46.4% on Humanity's Last Exam and 66.1% on DeepSearchQA, a new benchmark Google is open-sourcing specifically for evaluating multi-step web research tasks.

The company also shared early customer testimonials: financial firms using it for due diligence, biotech companies accelerating drug discovery research, and market research teams compressing multi-day investigations into hours. These are the use cases where autonomous research could genuinely add value—if the technology actually works as described in production environments rather than curated benchmarks.

Let's examine what we know, what we don't, and what matters for anyone considering building on this API.

What Deep Research Actually Does

Gemini Deep Research operates by planning investigations iteratively. It formulates queries, reads results, identifies gaps in understanding, and searches again until it has sufficient information to generate a report. Google emphasizes this release features "vastly improved web search" that can navigate deep into websites for specific data, not just surface-level results.

The agent handles multiple input types—PDFs, CSVs, documents, and public web data. It supports large context windows, letting developers include extensive background information directly in prompts. Outputs are customizable through prompting (you can define structure, headers, data tables) and include granular citations for verification. It also supports structured JSON outputs for downstream application parsing.

That's a comprehensive feature set. Whether it translates to reliable performance on tasks that don't resemble benchmark questions is the part we can't evaluate until developers start building on it and reporting real-world results.

DeepSearchQA: A Benchmark Google Created for Itself

Google is open-sourcing DeepSearchQA alongside the API release—900 hand-crafted "causal chain" tasks across 17 fields where each step depends on prior analysis. Unlike traditional fact-based tests, it measures comprehensiveness, requiring agents to generate exhaustive answer sets rather than single correct answers.

This addresses a real limitation of existing benchmarks: they don't capture the complexity of multi-step research where you need to synthesize information from multiple sources, follow reasoning chains, and identify what's missing. DeepSearchQA is more realistic in that sense.

It's also a benchmark Google designed, and on which their own agent achieves the highest scores. That doesn't mean the results are invalid, but it does mean we're evaluating Google's agent on a test Google created specifically to measure what their agent does well. Independent benchmarks from third parties will provide better signal about whether Deep Research's capabilities generalize beyond the scenarios Google optimized for.

The Customer Testimonials: Early Signals, Not Proof

Google includes quotes from GV (Google Ventures) and Axiom Bio praising Deep Research's impact. GV's partner says it "shortened research cycles from days to hours without loss of fidelity or quality." Axiom Bio's co-founder says it "surfaces granular data and evidence at and beyond what previously only a human researcher could do."

These are strong endorsements. They're also from early adopters with direct Google relationships, tested in controlled scenarios where Google likely provided implementation support. That's standard practice for product launches, but it means these results may not reflect what happens when arbitrary developers integrate the API into arbitrary workflows without Google's assistance.

The real test will be whether Deep Research performs reliably for teams that don't have GV's resources or Axiom Bio's technical sophistication—and whether it handles edge cases, ambiguous queries, and domains outside its training distribution as effectively as it handles the structured research tasks these testimonials describe.

The 46.4% Score on Humanity's Last Exam

Google highlights that Deep Research achieves 46.4% accuracy on Humanity's Last Exam, positioning this as state-of-the-art. Humanity's Last Exam is designed to test questions that require expert-level reasoning—the kind that should be difficult for AI systems.

A 46.4% score means the agent is wrong more than half the time on questions specifically designed to be hard. That's consistent with where frontier models are on complex reasoning tasks: better than previous generations, still unreliable enough that you can't trust outputs without verification.

Google frames this positively, emphasizing it's the best score achieved so far. That's technically accurate. It also means Deep Research fails the majority of expert-level questions it attempts, which is important context for anyone considering whether to trust its research outputs on high-stakes decisions.

Inference Time Scaling: More Compute, Better Results

Google's technical report notes that Deep Research's performance improves significantly when allowed to perform more searches and reasoning steps—what they call "inference time scaling." Their pass@8 versus pass@1 comparison shows substantial accuracy gains when the agent explores multiple parallel trajectories for answer verification.

This is both promising and concerning. Promising because it suggests the agent can self-correct and improve through iteration. Concerning because it means reliable performance requires more compute, longer latency, and higher cost per query—exactly the tradeoffs that make autonomous agents difficult to deploy at scale.

If Deep Research needs eight parallel attempts to reliably solve a problem that humans solve in one attempt, the economics only work for use cases where research time is extremely expensive or where accuracy improvements justify significantly higher compute costs. Those use cases exist, but they're narrower than the broad "autonomous research agent" framing suggests.

What Google Isn't Saying About Limitations

The announcement doesn't discuss failure modes, accuracy thresholds, or scenarios where Deep Research performs poorly. It doesn't specify how the agent handles conflicting sources, how it evaluates source credibility, or what happens when required information doesn't exist on the public web.

It also doesn't address the practical challenges of integrating autonomous agents into existing workflows: how to validate outputs, when to trust the agent versus performing manual research, how to handle cases where the agent confidently presents incorrect information.

These aren't hypothetical concerns. They're the implementation challenges every team building on AI agents encounters—and they're usually harder to solve than the technical integration itself.

The Real Question: Who Should Build on This?

Deep Research makes sense for specific use cases: preliminary research that needs breadth over perfection, investigation tasks where synthesizing multiple sources manually is time-prohibitive, and scenarios where AI-generated drafts accelerate human review rather than replacing it entirely.

It makes less sense for tasks requiring 100% accuracy, domains where source credibility is contested, or workflows where incorrect AI outputs create more work than they save. The challenge is that you often don't know which category your use case falls into until you've already invested time building on the API.

Google's roadmap includes chart generation, Model Context Protocol support for custom data sources, and Vertex AI availability for enterprises. These are useful enhancements. They don't change the fundamental constraint that autonomous agents are as reliable as their training data, search results, and reasoning capabilities allow—and all three have documented limitations.

Adoption Will Tell Us What Benchmarks Can't

The value of Gemini Deep Research won't be determined by benchmark scores or launch testimonials. It'll be determined by whether developers build production systems on it, whether those systems perform reliably enough to justify the integration effort, and whether the cost-benefit calculus works for use cases beyond the early adopters Google featured.

That data doesn't exist yet. Google is making a capable research agent available to developers for the first time. What happens next depends on how well it performs outside controlled environments—and whether "state-of-the-art on benchmarks" translates to "reliable enough to trust with real work."

If you're evaluating AI agent capabilities for research workflows and need help understanding what accuracy scores actually mean in practice, Winsome's team can walk you through the implementation realities benchmarks don't capture.