2 min read

Anthropic Just Gave AI Agents a Definition of Done

Anthropic Just Gave AI Agents a Definition of Done
Anthropic Just Gave AI Agents a Definition of Done
3:38

Anthropic released a new feature for its Managed Agents API called Outcomes — and it's a more meaningful shift in how AI agents work than the name suggests.

Here's the short version: you tell the agent what "done" looks like, give it a rubric, and it iterates against that rubric until the work passes — or until it runs out of attempts.

A separate grader model evaluates each iteration in an isolated context window, returns a per-criterion breakdown of what passed and what didn't, and hands that feedback back to the agent for the next pass.

The agent keeps working until the grader says satisfied, hits max_iterations, or the whole thing breaks down with failed.

No more prompting and hoping. You define the finish line.

How It Actually Works

The rubric is a markdown document.

You write per-criterion scoring criteria — specific, measurable, honest — and pass it either inline or via the Files API for reuse across sessions.

The agent receives a user.define_outcome event and starts working immediately.

You don't need to send a separate message to kick it off.

The grader runs in a separate context window by design.

That's not a small detail — it means the evaluator isn't influenced by the agent's own reasoning trail or the choices it made to get there. It's reading the output, not the process. The feedback it returns is specific: not "try harder," but "revenue projections are missing five years of historical data" or "WACC assumptions are not stated."

Iteration events surface on the stream as span.outcome_evaluation_start, span.outcome_evaluation_ongoing, and span.outcome_evaluation_end.

The result field on that final event tells you exactly what happened:

satisfied, needs_revision, max_iterations_reached, failed, or interrupted.

Output files land in /mnt/session/outputs/ and are retrievable via the Files API scoped to the session.

Default iterations are set to 3, with a maximum of 20.

The Problem This Solves

Anyone who has run AI agents on real deliverables knows the failure mode: the agent produces something plausible-looking that misses the actual requirement, and you only find out when a human reviews it. The feedback loop is slow, the corrections are manual, and the "autonomous" part of autonomous agent turns out to mean "autonomously produces a first draft."

Outcomes move the quality gate inside the loop. The grader is doing what a reviewer does — checking the work against stated requirements — but doing it programmatically, per iteration, before the output ever reaches a human. That's not a replacement for human review on high-stakes work. It's a first-pass filter that raises the floor.

What It Means for Teams Building on Claude

If you're running agentic workflows for content production, financial modeling, research synthesis, or any task where "good enough" has a definition, this is infrastructure worth integrating. The rubric format is flexible enough to encode real quality standards — not just "write clearly" but specific, verifiable criteria.

The honest caveat: rubric quality determines everything. A vague rubric produces a vague grader. The teams that get the most out of this will be the ones that can articulate what done actually looks like — which, it turns out, is harder than it sounds and more valuable than most organizations realize.

Marketing and growth teams exploring AI-powered workflows should be paying attention here. Our team at Winsome Marketing helps organizations translate capabilities like this into real operational strategy — not just experiments. Let's talk.