Google's Latest AI Upgrades: SIMA 2 and What It Signals
We've watched Google chase OpenAI's shadow for two years now. Each announcement arrives with carefully managed expectations, each demo polished to...
3 min read
Writing Team
:
Nov 18, 2025 7:00:01 AM
A research team has published a new paper introducing Supervised Reinforcement Learning (SRL), a training framework designed to help smaller open-source language models solve multi-step reasoning problems that previously stumped them. The work addresses a fundamental training gap that has limited the capabilities of models that can't afford to sample millions of solutions.
Current training methods for reasoning tasks fall into two camps, and both have significant limitations:
Supervised Fine-Tuning (SFT) trains models by showing them complete solution demonstrations and teaching them to imitate each token. The problem: models tend to memorize rigid patterns rather than learning flexible reasoning. They overfit to the exact sequence of steps in training examples and struggle when problems require slightly different approaches.
Reinforcement Learning with Verifiable Rewards (RLVR) trains models by letting them attempt solutions repeatedly and rewarding correct answers. The problem: when correct solutions are extremely rare—which they are for challenging problems and smaller models—the model never receives positive feedback. You can't learn from rewards you never get.
This creates a catch-22 for small open-source models. SFT makes them rigid. RLVR starves them of learning signals. Neither approach enables them to tackle genuinely difficult multi-step problems.
Supervised Reinforcement Learning reformulates problem-solving as a sequence of discrete logical "actions" rather than raw text generation. The key innovation: the model generates an internal reasoning monologue before committing to each action, and receives partial credit based on how similar its actions are to expert demonstrations.
Here's the structure:
Step-wise action decomposition: Instead of treating a solution as one continuous text stream, SRL breaks it into logical steps. Each step is an "action" that advances toward the solution.
Internal reasoning: Before each action, the model generates explicit reasoning about what to do next and why. This forces deliberate thinking rather than pattern matching.
Smoother rewards: Rather than binary correct/incorrect feedback, SRL provides graduated rewards based on action similarity to expert demonstrations from the SFT dataset. Even when the final answer is wrong, the model receives credit for reasoning steps that matched expert logic.
Flexible guidance: Because rewards are based on action-level similarity rather than exact token matching, the model can explore alternative reasoning paths while still learning from expert examples.
The researchers demonstrate that SRL enables small models to learn problems that neither SFT nor RLVR could solve independently. More significantly, they found that initializing training with SRL before refining with RLVR produces the strongest overall performance—combining the rich learning signals from expert demonstrations with the eventual precision of verifiable rewards.
The framework also generalizes beyond pure reasoning benchmarks to agentic software engineering tasks, suggesting the approach scales across different types of structured problem-solving.
Large frontier models from OpenAI, Anthropic, and Google can afford brute-force approaches. They sample millions of potential solutions until they find correct ones, then learn from those. Smaller open-source models can't do this—they have limited compute budgets and can't generate enough samples to stumble onto rare correct solutions.
SRL offers a path forward by making every training attempt informative. Even failed rollouts provide learning signals through partial credit for correct intermediate steps. This is particularly valuable for the open-source ecosystem, where most models operate with far fewer resources than frontier labs.
The paper describes extracting expert actions from existing SFT datasets, which means SRL doesn't require new data collection—it reuses training data more effectively. The action-similarity rewards are computed step-wise during training, providing continuous feedback rather than waiting for complete solutions.
The internal reasoning monologue serves dual purposes: it makes the model's thinking explicit for evaluation, and it forces a more deliberate problem-solving process that prevents shortcut learning.
SRL represents a pragmatic middle ground between pure imitation and pure reward-seeking. By combining the structure of supervised learning with the flexibility of reinforcement learning, it addresses practical limitations that have constrained smaller models.
The framework's success on both reasoning benchmarks and software engineering tasks suggests it may be useful for any domain where problems decompose into logical steps and expert demonstrations exist. This could include complex content generation, multi-step data analysis, structured decision-making, and workflow automation.
For teams building specialized models with limited compute budgets, SRL offers a training approach that doesn't require finding needles in haystacks—it teaches models to recognize the shape of needles even when the full haystack isn't available. That's a meaningful advance for the practical deployment of reasoning-capable AI outside frontier labs.
The research is available on arXiv (2510.25992) with full technical details and experimental results.
We've watched Google chase OpenAI's shadow for two years now. Each announcement arrives with carefully managed expectations, each demo polished to...
4 min read
The AI industry just witnessed a major breakthrough that could reshape how machines process and reason about complex, lengthy documents. Alibaba's...
Computer scientist Randy Goebel has been running a competition for over a decade that exposes AI's most fundamental weakness in legal reasoning: it...