3 min read
Perplexity's Deep Research Now Runs 20 AI Models at Once
Writing Team
:
Jun 19, 2026 8:00:00 AM
Research has always been the task that AI tools promise to fix and rarely do completely. You get a summary. You still have to go find the sources. You still have to check if they say what the AI claims. You still have to turn the output into something a colleague can actually use.
Perplexity is making a direct argument that this version is different. The benchmark data is worth examining.
Key Points
- Deep Research moved into Computer: Perplexity's research tool now runs inside Computer, its multi-model orchestration system that coordinates up to 20 frontier AI models in a single workflow.
- The benchmark gains are substantial: BrowseComp accuracy jumped from 40.7% to 83.8%. Humanity's Last Exam rose from 36.4% to 50.5%. These are not incremental improvements.
- Search as Code is the architectural shift: Instead of running a fixed pipeline, the system writes code that assembles the search dynamically, running thousands of retrieval steps in parallel tailored to each question.
- Outputs are work-ready: Reports, decks, and live dashboards come out the other end, not raw summaries. The system reads your internal files alongside the live web and cites every claim inline.
- The benchmarks are first-party: Perplexity published these numbers itself. Independent verification does not yet exist, which matters when evaluating the size of the gains claimed.
What Changed Under the Hood
The previous version of Deep Research ran a fixed pipeline: search, read, summarize, cite. The new version, running inside Perplexity Computer, starts by breaking a complex question into subtasks and routing each one to the model best suited for it. Legal reasoning goes to a legal model. Data analysis goes to a data model. Final synthesis goes to a writing model. Opus 4.6 serves as the core reasoning engine, with Gemini handling deep research subtasks.
The orchestration layer is what Perplexity calls Search as Code. Rather than following the same retrieval sequence every time, the system writes code on the fly that assembles the search based on the specific question. That code runs in a sandbox, calls Perplexity's Agentic Search SDK, and executes thousands of retrieval steps in parallel. It can branch, compare sources, and refine its approach as it learns what the question actually requires. This is meaningfully different from a search tool that runs the same steps regardless of what you ask.
The system also reads your internal files. Upload a PDF or spreadsheet and it cross-references your internal data against live web sources, census data, and databases including PitchBook and CB Insights.
The Numbers That Matter
The BrowseComp result is the one worth paying attention to. BrowseComp, developed by OpenAI, tests an agent's ability to find hard-to-find information by actively browsing many pages. Legacy Deep Research scored 40.7%. Deep Research in Computer scored 83.8%. That is not a refinement. That is a doubling of performance on a benchmark specifically designed to test the thing this product claims to do.
Humanity's Last Exam, which covers expert-level questions across academic disciplines, moved from 36.4% to 50.5%. DeepSearchQA, a Google DeepMind benchmark where the legacy system already performed well at 81.9%, improved modestly to 85%.
The caveat is real: these are Perplexity's own numbers. The company published them to accompany the product launch. Independent benchmarking has not yet validated these figures, and the history of AI benchmark announcements includes enough first-party optimism to warrant some patience before treating them as settled.
What the Outputs Look Like
The practical question for any team evaluating this is what comes out the other end. Perplexity describes the deliverables as reports, briefs, decks, and live spreadsheets, with the system writing inside the file rather than alongside it. It shows a preview before any change is applied, requiring human approval before it lands.
The starter use cases Perplexity ships with the product give a sense of the intended scope: comparing cash flow and profit margins across AI chip companies over five years, mapping US and European data privacy law differences into a comparison table, synthesizing clinical trial evidence on a specific drug class, benchmarking AI models on reasoning, cost, and context length. These are not simple queries. They are the kind of research tasks that currently take a skilled analyst several hours.
Whether the outputs actually hold up to that standard in practice, across messy real-world questions rather than benchmark-optimized tasks, will be determined by user experience over the coming months.
What This Means for Marketing and Research Teams
For growth teams and content operations, the relevant shift here is the move from AI as a summarizer to AI as a research executor. If the BrowseComp gains reflect real-world performance, a tool that can navigate hundreds of sources, cross-reference internal data, route subtasks to specialized models, and return a cited, formatted deliverable would significantly change the economics of research-heavy work.
The competitive intelligence use case is obvious. So is the content research workflow, the market analysis brief, and the campaign planning process that currently involves someone spending a day pulling together data from six different sources. Our AI marketing services at Winsome are increasingly focused on exactly these workflow questions: where does AI-assisted research create durable value, and where does human judgment remain the irreplaceable part of the process.
The honest answer right now is that cited does not mean correct, and work-ready does not mean finished. Human review is still necessary. But if Perplexity's numbers hold, the amount of human review required just got smaller, and that changes the math on what a lean research operation can produce.
If you want help figuring out where tools like this fit into your actual workflow rather than the ideal one, talk to our team.

