4 min read

New Study Measures What AI Coding Studies Ignore

New Study Measures What AI Coding Studies Ignore
New Study Measures What AI Coding Studies Ignore
7:57

We finally have research that asks the right question about AI coding tools.

Dave Farley, co-author of Continuous Delivery and host of the Modern Software Engineering channel, recently published findings from a pre-registered controlled experiment that measured something most AI productivity studies completely ignore: what happens when the next developer has to maintain AI-generated code.

This matters because maintenance costs represent 50-80% of total software ownership expenses over a system's lifetime—three to four times more than initial development. Yet most AI coding studies stop at "did the developer finish faster?" That's measuring typing speed, not engineering impact.

The Study That Actually Simulated Real Work

This wasn't undergraduate students completing toy assignments. The research involved 151 participants, 95% of whom were professional software developers—a rarity in academic studies that typically rely on student populations because they're easier to recruit.

The experiment used a two-phase design that mirrors actual software development reality:

Phase One: Developers added features to buggy, unpleasant Java web application code. Some used AI assistants (GitHub Copilot, Cursor, Claude Code, ChatGPT). Others worked without AI.

Phase Two: A different set of developers was randomly assigned the code produced in Phase One and asked to evolve it, without knowing whether it was originally written with AI assistance or not. Crucially, no AI assistance was allowed in Phase Two.

This design isolates the key variable: how easy is AI-generated code for someone else to change later? That's the actual test of code health and maintainability.

What They Measured (And Why It Matters)

The researchers didn't guess—they measured multiple dimensions of maintainability:

  • Time: How long the next developer took to evolve the code
  • Objective code quality: Using Code Scene's code health metric
  • Test coverage: Actual percentage of code under test
  • Perceived productivity: Using the SPACE framework

This multi-dimensional approach acknowledges that maintainability isn't a single magic number. Anyone claiming otherwise should be treated with suspicion.

The Findings That Challenge Both Hype and Fear

The headline result: There was no significant difference in maintenance cost between AI-generated and human-generated code.

Code written with AI assistance was no harder to change, no easier to change, no worse in quality, and no better in quality from a downstream perspective. AI didn't break anything. Given the fear-mongering around "AI slop," that's a significant finding—and one that appears to be new to this research.

The expected result: AI users in Phase One were approximately 30% faster to reach a solution. Habitual AI users were closer to 55% faster. Yes, AI speeds up initial development. That's no longer controversial.

The interesting result: When experienced developers who already knew what they were doing used AI habitually, their code showed a small but measurable improvement in maintainability later on.

One explanation: AI tends to produce boring, idiomatic, unsurprising code. And boring code is maintainable code. Surprise is usually the enemy of maintainability.

New call-to-action

The Critical Caveat: AI Amplifies, It Doesn't Replace

What's absolutely clear from the research: AI does not automatically improve code quality. Developer skill matters more than AI usage.

As Farley notes, "AI code assistance acts as a kind of amplifier. If you're already doing the right things, AI will amplify the impact of those things. If you're already doing the wrong things, AI will help you to dig a deeper hole faster."

This aligns with recent DORA research on AI impact: tools amplify capability, they don't replace it.

Jason Gorman's breakdown of "doing the right things" in AI-assisted coding includes:

  • Working in small batches, solving one problem at a time
  • Iterating rapidly with continuous testing, code review, refactoring, and integration
  • Architecting highly modular designs that localize the blast radius for changes
  • Organizing around end-to-end outcomes instead of role or technology specialisms
  • Working with high autonomy, making timely decisions instead of escalating everything

In other words: fundamental software engineering discipline still matters—perhaps more than ever.

The Long-Term Risks Nobody's Measuring

The study authors highlight two slippery slopes toward disaster:

Code bloat: When generating code becomes almost free, teams generate far too much of it. Volume alone drives complexity, and AI makes it easier than ever to drown in your own codebase.

Cognitive debt: If developers stop thinking deeply about the code they create, understanding erodes, skills atrophy, and innovation slows. This long-term risk doesn't show up in sprint metrics.

What Marketing and Growth Teams Should Learn

If you're building marketing technology systems, internal tools, or automation platforms, this research offers practical guidance:

AI coding tools improve short-term productivity without damaging maintainability—when used by people who already understand good engineering practices. They don't remove the need for good design, decomposition skills, or hard thinking about problem-solving.

The real technical skill isn't typing speed. It's decomposition—breaking problems into small pieces that AI assistants can handle well, then guiding them toward solutions you're actually happy with.

Need help building AI-assisted development practices that prioritize long-term maintainability? Winsome's growth experts help teams implement AI tools strategically—not recklessly.


Study Methodology, Approach, and Key Findings

Source: Dave Farley, Modern Software Engineering channel (February 2025)

Methodology

  • Participants: 151 total, 95% professional software developers (not students)
  • Design: Pre-registered controlled experiment with two phases
  • Technology: Java web application with realistic complexity
  • Phase One: Developers add features to buggy code; some use AI assistants (GitHub Copilot, Cursor, Claude Code, ChatGPT), others don't
  • Phase Two: Different developers randomly assigned Phase One code to evolve; no knowledge of whether AI-assisted; no AI tools allowed
  • Control: Variables measured rather than assumed

Measurements

  • Time to complete evolution tasks
  • Objective code quality (Code Scene's code health metric)
  • Test coverage percentages
  • Perceived productivity (SPACE framework)
  • Multi-dimensional approach acknowledging maintainability isn't a single metric

Key Findings

  • No significant difference in maintenance cost between AI-generated and human-generated code
  • No quality difference downstream—code neither harder nor easier to change
  • 30% speed increase for AI users in initial development (Phase One)
  • 55% speed increase for habitual AI users in initial development
  • Small measurable improvement in maintainability when experienced developers used AI habitually
  • Developer skill matters more than AI usage in determining code quality
  • AI acts as amplifier: Amplifies good practices if present; amplifies poor practices if present
  • No evidence of hidden costs from AI-assisted development in maintenance phase

Identified Risks

  • Code bloat: Nearly-free code generation encourages over-production and complexity
  • Cognitive debt: Reduced thinking leads to eroded understanding and atrophied skills over time
  • Long-term risks don't appear in short-term sprint metrics

Critical Conclusion

AI assistants improve short-term productivity without damaging maintainability—but only when used by developers who already practice good engineering discipline, decomposition, and thoughtful problem-solving.

Google AI Studio's New Build Interface: Lowering the Floor Without Raising the Ceiling

Google AI Studio's New Build Interface: Lowering the Floor Without Raising the Ceiling

Google just rolled out a significant redesign of AI Studio's "Build" interface, and the target is clear: eliminate the friction between "I have an...

Read More
CatAttack Study Exposes Vulnerabilities in AI Reasoning Models

CatAttack Study Exposes Vulnerabilities in AI Reasoning Models

A groundbreaking study from researchers at Collinear AI, ServiceNow, and Stanford University has exposed a fundamental vulnerability in...

Read More
A Bot as a Study Buddy? OpenAI's Study Mode is Live

A Bot as a Study Buddy? OpenAI's Study Mode is Live

OpenAI just solved the biggest problem in education, and most people haven't even noticed yet. Study Mode launched this week, turning ChatGPT from a...

Read More