We've all been there. You prompt an LLM to write code, it spits out something that technically works, but it doesn't feel right. The variable names are generic. The structure violates your team's conventions. It passes the tests but fails the smell test. You iterate, reprompt, coax the model toward something you'd actually commit to production. That back-and-forth isn't a bug—it's the entire workflow. And until now, we've had no systematic way to measure whether models are getting better at it.
Enter "Vibe Checker," a research paper from a team including Ming Zhong, Xiang Zhou, and colleagues at Google and the University of Illinois. Published on arXiv October 8, 2025, this work introduces the first rigorous framework for evaluating what they call "vibe coding"—the iterative process of generating and refining code until it passes your vibe check. Not just functional correctness, but the full spectrum of human preference: readability, intent preservation, stylistic consistency, and adherence to non-functional instructions.
This is the evaluation methodology we've been waiting for. And the results reveal something crucial: even the strongest models are failing at the things that matter most to real developers.
The current standard for evaluating code generation models is pass@k—a metric that measures whether generated code produces correct outputs for a given set of test cases. It's clean, deterministic, and fundamentally incomplete. Pass@k treats code as a black box. It doesn't care if your function is named foo
, your logic is nested five layers deep, or your solution ignores explicit instructions about performance constraints.
In real-world use, developers don't just want code that works. They want code that adheres to team conventions, respects architectural constraints, uses specified libraries, avoids deprecated methods, and reads like a human wrote it. These aren't edge cases—they're the norm. According to the 2024 Stack Overflow Developer Survey, 76% of professional developers using AI tools report spending significant time refining AI-generated code to meet style and maintainability standards.
The Vibe Checker research team hypothesizes that instruction following is the missing piece underlying the vibe check. They argue that human preference in code evaluation is a composite of functional correctness and adherence to non-functional instructions—the kind developers routinely specify but models routinely ignore.
To test this hypothesis, the team built VeriCode, a taxonomy of 30 verifiable code instructions with corresponding deterministic verifiers. These aren't subjective vibes—they're measurable constraints that represent real developer requirements:
They then augmented established evaluation suites (HumanEval, MBPP, etc.) with these instructions, creating Vibe Checker—a testbed that assesses both functional correctness and instruction following simultaneously.
This is important. For the first time, we can quantify whether a model not only solves the problem but solves it the way you asked.
The team evaluated 31 leading LLMs using Vibe Checker, including GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro, and various open-source models. The findings are striking:
Put simply: a solution that's 90% correct and follows all your instructions beats a 100% correct solution that ignores half your requirements. Because in production, you'll spend more time fixing the latter.
If you're building marketing automation, personalization engines, or content generation pipelines with LLM-generated code, this research has immediate implications:
You're already doing vibe coding, whether you call it that or not. Every time you prompt an LLM to "write a Python script to aggregate GA4 data but use pandas not numpy and keep it under 50 lines," you're specifying non-functional requirements. Vibe Checker suggests most models are ignoring at least one of those constraints.
Current benchmarks don't reflect your workflow. If you're evaluating models based on pass@k alone, you're optimizing for the wrong thing. The code that scores highest on functional correctness may be the hardest to maintain, debug, or integrate into your existing stack.
Better benchmarks drive better models. The reason models underperform on instruction following is partly because they're not trained or evaluated on it. Vibe Checker provides a concrete framework for model developers to target. We should expect rapid improvement here—Google's own models will likely prioritize these metrics in future releases.
This explains why Claude feels different. Anthropic's models have consistently felt more aligned with developer intent, even when functional correctness is comparable to competitors. Vibe Checker offers a possible explanation: Claude may be better at implicit instruction following, even if it hasn't been explicitly trained on VeriCode's taxonomy. Anecdotal evidence from developer communities on Reddit and 𝕏 supports this—users frequently cite Claude's ability to "understand what I meant" as a key differentiator.
The Vibe Checker paper doesn't just diagnose the problem—it provides a roadmap. The 30-instruction taxonomy is open and extensible. Model developers can now train explicitly on instruction-following objectives. Evaluation suites can adopt composite scoring that weights both correctness and constraint adherence.
For marketing teams using AI code generation, this means:
The research team's conclusion is optimistic: "Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding." They're right. This is the kind of research that changes how models are built.
What makes Vibe Checker genuinely important isn't just the taxonomy or the benchmark—it's the underlying insight that human preference is measurable and should be the primary optimization target. We've spent years chasing functional correctness as a proxy for utility, when what users actually want is alignment with intent.
This applies far beyond code generation. In marketing content, creative workflows, and strategic analysis, the gap between "technically correct" and "feels right" is where real value lives. Vibe Checker proves we can quantify that gap. And once you can measure something, you can improve it.
The vibe check isn't vibes anymore. It's a testable hypothesis, a concrete benchmark, and a clearer path toward models that do what we actually want—not just what we technically asked for.
If you're building marketing systems with AI-generated code and need a team that understands both the technical constraints and the human preferences that matter, we're here. Let's build something that passes the vibe check.