7 min read

Search as Code: What Perplexity's New Architecture Means

Search as Code: What Perplexity's New Architecture Means
Search as Code: What Perplexity's New Architecture Means
14:54

Key Points

  • Search is no longer a fixed pipeline: Perplexity's Search as Code (SaC) lets AI agents assemble custom retrieval workflows from composable primitives instead of calling a monolithic search endpoint.
  • The bottleneck was control, not intelligence: Frontier models could already reason well. The problem was they couldn't control how context was retrieved, only what query to submit.
  • SaC outperforms every tested system on 4 of 5 benchmarks: Including a 2.5x lead on WANDR, the most complex multi-step research benchmark, while dramatically cutting token costs.
  • This changes what "optimizing for search" means: When agents can run thousands of retrieval operations per task and filter by source class at the query level, traditional keyword strategy and domain authority assumptions no longer fully hold.
  • GEO and source selection are no longer passive: Agents in SaC architectures are encoding source-class rules directly into their retrieval logic. Your content either gets written into the query plan or it doesn't.

For years, search worked the same way whether the user was human or machine. You submit a query. The engine runs its pipeline. You get back a ranked list. The consumer adapts.

That contract is breaking. Perplexity just published the architecture that replaces it, and it's worth understanding at a mechanical level, not just conceptually. Because what they've built changes the unit of search from "a query and its results" to "a program that assembles the search process itself."

This isn't incremental. It's a rewrite of the relationship between retrieval and reasoning.

What Search as Code Is

The core idea is this: instead of exposing a single search endpoint that accepts a query and returns results, Perplexity has broken their search stack into atomic primitives and wrapped them in an SDK. Models running inside agent harnesses can now write Python code that assembles those primitives into a custom retrieval pipeline tailored to each specific task.

The result is that a model doesn't just ask search a question. It programs the search process.

That means a single task can involve hundreds or thousands of discrete retrieval operations, parallel fetches, deduplication, source filtering, custom ranking, intermediate state management, and LLM-assisted planning subroutines -- all within a single inference turn. Perplexity documents a case where their CVE research task reduced token usage by 85% (from 288,700 tokens to 42,900) while achieving 100% accuracy, compared with tested non-Perplexity systems that all scored under 25%.

That kind of efficiency isn't coming from a better search API. It's coming from the model, which controls what the search API actually does.

Why the Old Model Was a Bottleneck

Traditional search pipelines have one lever: the query. Everything downstream of that -- how results are ranked, filtered, aggregated, and returned -- is fixed by the engine. For human users, this was never a problem. They couldn't micromanage the pipeline even if it were offered.

For AI agents, it was a growing structural failure. Perplexity identifies three specific failure modes worth understanding.

Coarse context

A monolithic search endpoint optimized for recall floods the model with noise when it only needs one surgical fact. Token efficiency collapses. The model reasons over irrelevant material. Accuracy degrades.

Inability to leverage domain knowledge

The model may know, based on prior context or its training, that a specific source class, lexical/semantic blend, or result structure would serve this particular task better. A fixed query interface offers no way to act on that knowledge. The intelligence exists but the interface won't accept it.

Serial control flow and context pollution

Many research workflows aren't naturally linear. They need fan-out across query variants, parallel fetching, deduplication, and conditional continuation. Forcing these through repeated model turns adds latency, blows up cost, and fills the context window with intermediate noise the model then has to reason around.

SaC eliminates all three by moving the control boundary lower in the stack.

The Three-Layer Architecture

Perplexity's SaC implementation rests on three coupled layers.

Models serve as the control plane

They reason about the task, decompose it into retrieval subtasks, decide what pipeline each subtask requires, and generate Python code to implement it. The model isn't calling search. It's writing the search program.

Sandboxes execute that code deterministically

Perplexity settled on filesystem-based state management over a REPL approach after testing both. The key finding: requiring models to serialize state explicitly rather than implicitly produces better reliability on long trajectories. There's an insight buried here about declarative versus implicit state management that matters well beyond search.

The Agentic Search SDK exposes atomized search primitives

This is not an API wrapped in a Python library. Perplexity rearchitected their search stack into modular components, then built the SDK to expose those components directly. High-level end-to-end pipelines are still available as a shorthand, but the model can bypass them entirely when the task calls for it.

The SDK itself was optimized through an autoresearch loop running for weeks. That loop proposed and validated changes to the SDK's structure against metrics including latency, codegen quality, and task performance. The resulting SDK has fewer than 2,000 tokens in its root SKILL.md -- constrained deliberately to prevent context bloat.

New call-to-action

What the CVE Case Study Actually Shows Search Professionals

The CVE research task is the most instructive concrete example in the paper, and it deserves more attention than a quick read gives it.

The task: identify and characterize over 200 high-severity CVEs from 2023-2025. Each record must cite the affected vendor's own advisory, name the product and fix version, and confirm the fix version is tied to that specific CVE.

Look at what the model's generated code actually does in Part 1. It doesn't just write queries. It encodes the source-class rule into the query structure itself: only vendor-owned advisory URLs qualify, and the exact-phrase constraints are engineered so that aggregator pages, NVD entries, MITRE records, CERT pages, and news stories are structurally excluded before results even come back.

This is not keyword matching. This is a retrieval strategy that bakes content sourcing logic into the retrieval layer.

In Part 2, the model uses an LLM as a planning subroutine. It summarizes which vendor-year pairs are returning insufficient coverage, asks a model to generate targeted refinements, validates each proposed query against structural rules (must be site-scoped, must mention CVE year), and only then executes the expanded set.

In Part 3, structured schema extraction verifies that each record actually binds one CVE to one product to one fix version in vendor-authored text. Confidence thresholds. Deduplication by CVE. Structural rejection of aggregator URLs. All implemented in agent-generated code, not in the SDK.

The implication for SEOs: source classification is now happening at the query generation level. If your content is the type of source that gets excluded structurally -- too much aggregation, too little primary authorship, insufficient specificity -- you may not make it into the retrieval candidate set regardless of your ranking signals.

Benchmark Results and What They Mean

SaC leads all five tested systems on four of five benchmarks, with a near-tie on HLE where OpenAI Responses matches it almost exactly.

The WANDR gap is the most significant number in the paper. WANDR was designed specifically to test complex, multi-step horizontal research tasks -- the kind of knowledge-intensive professional work that most resembles real agentic use cases. SaC leads the next-best system by a factor of 2.5x on WANDR. That benchmark is also still far from saturated, even for SaC, which means the ceiling on what this architecture can do hasn't been reached.

On cost-performance, SaC's low-reasoning configuration outperforms at least two competing systems while being cheaper than all of them. Medium reasoning beats all non-SaC systems for under $1 per task on DSQA. That's not a research result. That's a commercial deployment profile.

The tested systems include OpenAI Responses API (GPT 5.5 high reasoning with web search and code interpreter), Anthropic Managed Agents (Opus 4.7 high reasoning), Exa Agent, and Parallel Tasks on ultra4x. SaC outperforms all of them. The fact that it does so using the same underlying model as OpenAI's evaluated configuration means the architecture is producing the performance gain, not the model.

What This Changes for Search Visibility

The immediate tactical implication is this: when an agent has fine-grained control over retrieval, traditional ranking signals operate differently.

Position still matters, because the SDK can still call high-level search pipelines that return ranked lists. But the model now has the ability to filter, re-rank, and structurally exclude results by criteria the standard pipeline never applied. A site can rank highly in conventional search and still be written out of an agentic retrieval workflow because the agent's generated code defines it as the wrong source class for that task.

GEO research (Aggarwal et al., KDD 2024) already showed that citation addition and statistics addition boost visibility in generative engines by up to 40%. Keyword stuffing, by contrast, hurt performance. Those findings hold here, but SaC adds another layer: it's not just about what your content contains, it's about what category of source your content is.

The CVE case study encodes this explicitly. Vendor-authored, directly relevant, specific, URL-verifiable. Content that fails those filters doesn't get retrieved. Not because it ranked poorly, but because the agent decided in advance it wasn't the right kind of source.

Content strategy implications that follow from this:

Source identity matters more than keyword coverage

If your content is the authoritative primary source on a topic -- not an aggregation, not a round-up, not a derivative piece -- it is more likely to survive structural source filtering in agentic retrieval workflows.

Specificity and verifiability are functional ranking signals

The schema extract in Part 3 requires confidence above 0.75, version numbers, vendor names, and explicit CVE binding. Content that provides extractable, structured, specific claims survives this filter. Content that speaks in generalities does not.

Entity clarity is retrieval infrastructure

If your brand, your claims, your data, and your authorship aren't clearly associated at the page and domain level, the model's source-class reasoning has less to work with. That gap shows up in exclusion, not just deprioritization.

The Bigger Picture for the Search Profession

Search as Code represents one end of a spectrum that's moving fast. On one end, a human types a query and evaluates a SERP. On the other end, an AI agent programs its own retrieval workflow, runs thousands of operations, and synthesizes a result without a human ever seeing a results page.

Most queries are still closer to the human end. But the high-value, knowledge-intensive tasks that professional search is most concerned with -- research, competitive intelligence, due diligence, complex question answering -- are moving toward the agentic end. That's exactly the territory SaC is built for, and where it most dramatically outperforms everything else.

The profession needs to shift the frame. Ranking is still real. Technical SEO still matters. But the question "will my content rank?" is increasingly incomplete. The full question is: will my content be retrieved, classified as the right kind of source, pass structural and confidence filters, survive deduplication, and make it into the context that the model actually reasons over?

Those are five different gates, not one.

Frequently Asked Questions

That was complex. This part doesn't get much better. Fait warning, etc.

What is Search as Code?

Search as Code (SaC) is Perplexity's search architecture that exposes individual search stack components as programmable primitives in an SDK. Instead of submitting a query to a fixed search pipeline, AI agents generate Python code to assemble custom retrieval workflows tailored to each specific task.

How is SaC different from a regular search API?

A regular search API accepts a query and returns processed results. SaC exposes the internal components of the search pipeline -- retrieval, ranking, filtering, aggregation -- as separate callable functions. The model programs how those components are combined rather than just providing input to a fixed process.

Does traditional SEO still matter for agentic search?

Yes, but it's insufficient on its own. Traditional ranking signals still influence which URLs appear in candidate sets. But agentic systems can then apply structural filters, source-class rules, confidence thresholds, and deduplication logic that conventional ranking doesn't account for. Content quality, specificity, primary authorship, and entity clarity matter in ways that keyword optimization alone doesn't address.

What is WANDR and why does Perplexity's SaC score matter?

WANDR is a new benchmark designed to test multi-step horizontal research tasks similar to professional knowledge work. Perplexity's SaC achieves a score of 0.386 versus the next-best system at 0.152, a 2.5x lead. The benchmark is described as unsaturated, meaning no system has come close to solving it fully.

What can SEOs do to improve visibility in agentic retrieval workflows?

Prioritize being the primary source rather than an aggregator, include extractable specific claims with clear attribution, build entity clarity at the page and domain level, and think in terms of source classification not just keyword match. Content that looks like the canonical answer to a question from a credible primary source is more likely to survive structural filtering in agentic pipelines.