Anthropic's $183 Billion Question: Have AI Valuations Lose Touch with Reality?
Here's a thought experiment: What if we told you that a company with $5 billion in revenue—impressive, sure—just convinced investors it's worth more...
3 min read
Writing Team
:
Dec 23, 2025 6:59:59 AM
Anthropic just released Bloom, an open-source framework for generating behavioral evaluations of frontier AI models. Before you reflexively scroll past another AI safety announcement, consider the actual problem this solves: creating high-quality behavioral evaluations traditionally takes months of researcher time, and by the time you finish, your evaluation is already contaminated in training data or obsolete due to capability improvements.
Bloom generates targeted evaluation suites in days. It takes a researcher-specified behavior—delusional sycophancy, self-preservation instincts, self-preferential bias—and automatically creates scenarios, simulates interactions, and scores results. Anthropic used it to benchmark four alignment-relevant behaviors across 16 frontier models. The entire process, from conceptualization to results, took "only a few days."
This is not incremental improvement. This is the difference between safety research that keeps pace with capability development and safety research that's perpetually playing catch-up.

Bloom operates through four automated stages: Understanding (analyzes the behavior to measure), Ideation (generates evaluation scenarios), Rollout (simulates interactions), and Judgment (scores transcripts for behavior presence). Unlike fixed evaluation sets, Bloom produces different scenarios on each run while measuring the same underlying behavior, which prevents evaluation contamination and maintains flexibility.
The validation results are genuinely impressive. Bloom successfully distinguished intentionally misaligned "model organisms" from production Claude models in 9 out of 10 cases. When it failed on the tenth (self-promotion), manual review revealed the baseline model actually exhibited similar behavior—meaning Bloom was correct to not distinguish them.
More critically: Claude Opus 4.1 as judge correlates 0.86 with human judgment on behavior presence scores. That's not perfect, but it's sufficient for rapid iteration and large-scale screening. The system shows particularly strong agreement at score extremes—exactly where it matters most for determining whether behaviors are present or absent.
Anthropic replicated an evaluation from Claude Sonnet 4.5's system card measuring self-preferential bias—models favoring themselves in decision-making tasks. Bloom reproduced the same model rankings as the original evaluation methodology, confirming Sonnet 4.5 exhibits the least bias.
But here's where automated generation enables deeper investigation: Bloom discovered that increased reasoning effort reduces self-preferential bias in Claude Sonnet 4, with the largest improvement between medium and high thinking levels. This wasn't from selecting other models more evenly—Sonnet 4 increasingly recognized the conflict of interest and declined to judge.
This is the kind of nuanced finding that emerges when you can rapidly iterate evaluations rather than laboriously hand-crafting them once. Anthropic could test different configurations, filter rollouts by quality criteria, and explore how various parameters affect results—all because the generation pipeline is automated and reproducible.
Bloom is available on GitHub with full technical documentation. This matters enormously. AI safety research suffers from reproducibility problems and evaluation brittleness—labs announce safety metrics without releasing evaluation details, making independent verification impossible. Publishing frameworks is better than publishing results.
Anthropic notes that "early adopters are already using Bloom to evaluate nested jailbreak vulnerabilities, test hardcoding, measure evaluation awareness, and generate sabotage traces." This is open-source working as intended: the research community immediately extending capabilities beyond the original use cases.
The framework integrates with Weights & Biases for experiments at scale and exports Inspect-compatible transcripts. It includes sample seed files to get started. These aren't trivial implementation details—they're the difference between "interesting research project" and "tool that other researchers will actually use."
Bloom evaluations must be cited with their seed configurations because different seeds produce different scenarios. This is simultaneously a strength (prevents contamination) and a limitation (makes exact replication harder). Researchers need to understand that Bloom measures behavioral tendencies across generated scenarios, not performance on fixed benchmark tasks.
The tool also requires careful configuration of models for each stage, interaction length and modality, scenario diversity, and secondary scoring dimensions. This flexibility is valuable for sophisticated users but creates complexity for researchers who just want default settings that work. Anthropic will need to provide clearer guidance on recommended configurations for common use cases.
Most importantly: automated evaluation is only as good as the behavior descriptions researchers provide. Garbage in, garbage out still applies. Bloom accelerates evaluation generation; it doesn't substitute for careful thinking about what behaviors actually matter for alignment.
If safety research takes months while capability research takes weeks, safety will always lag. Bloom doesn't solve all evaluation challenges, but it addresses the velocity problem directly. Researchers can now test hypotheses about model behavior at speeds comparable to capability development cycles.
This matters for practical deployment decisions. When organizations evaluate whether to deploy new AI systems, they need timely safety assessments—not evaluations that took six months and are already obsolete. Bloom enables continuous behavioral monitoring as models evolve, rather than snapshot assessments that become stale immediately.
The correlation with human judgment (0.86) means Bloom can handle initial screening and large-scale analysis, reserving expensive human evaluation for edge cases and final validation. This is appropriate division of labor: automate the bulk work, apply human judgment where it's irreplaceable.
Most AI safety announcements describe problems or propose theoretical frameworks. Bloom ships working code that researchers are already using. It solves a real bottleneck in safety research—evaluation generation speed—with empirically validated methods and open-source implementation.
This is how safety tooling should work: identifying concrete obstacles to alignment research, building practical solutions, validating them rigorously, and releasing them for community use. Anthropic deserves credit not just for the technical work but for the research philosophy: safety research needs infrastructure as much as it needs insights.
Winsome Marketing's growth consultants help teams implement AI safety evaluation frameworks appropriate to deployment risk. Let's discuss responsible AI governance.
Here's a thought experiment: What if we told you that a company with $5 billion in revenue—impressive, sure—just convinced investors it's worth more...
Here's a wild thought: while everyone's racing to build the fastest AI, Anthropic built the safest one—and somehow ended up winning the actual money...
Anthropic is preparing to fundamentally reposition Claude—not as a conversational AI you prompt repeatedly, but as a task delegation system you brief...