TOKYO, JAPAN - FEBRUARY 3: Open AI CEO Sam Altman speaks during a talk session with SoftBank Group CEO Masayoshi Son at an event titled "Transforming Business through AI" in Tokyo, Japan, on February 03, 2025. SoftBank and OpenAI announced that they have agreed a partnership to set up a joint venture for artificial intelligence services in Japan today. (Photo by Tomohiro Ohsumi/Getty Images)Getty ImagesCoding agents have cleared nearly every software benchmark that existed two years ago, and venture capital has responded accordingly. But a new MIT study across more than 100,000 developers shows the productivity gap that benchmarks cannot see: AI agents boosted the volume of code written by roughly 180%, while the amount of code that actually shipped to production rose by only about 30%. The gap between writing and shipping is where the real AI investment story is.Venture capital has funneled billions into AI coding tools since Cognition's Devin launched in early 2024 solving just 13% of tasks on the SWE-Bench standard software benchmark. Eighteen months later the best agents score in the high eighties on the same test, a pace of improvement that has convinced many investors that software engineering is a solved market. Sarah Guo, founder of Conviction, argued this week that the investor community has drawn precisely the wrong lesson from that trajectory."Nearly everyone drew the same wrong lesson: the model ate software engineering," Guo wrote. "But as the model swallowed the part of software engineering you can best measure, we're relearning what many teams knew: engineering has always resisted measurement, and the most measurable parts may not be the only important ones."The MIT data explains why: code generation is verifiable at near-zero cost, a compiler either accepts the output or it does not, a test suite passes or it fails. When verification is free, models can be trained against the check millions of times until they beat it. What cannot be verified cheaply is whether a given change is the right one for a specific production system, a decade-old codebase with undocumented dependencies and a deploy pipeline no one will own. That correctness cannot be read off a leaderboard. It can only be confirmed by running the system long enough under real load, a clock that no model capability improvement can shorten.Noam Brown, who led development of OpenAI's reasoning models, framed the constraint: the only reliable way to evaluate an agent across a one-year time horizon may be to run it for a year. Investors pricing AI application companies on benchmark progress are measuring the part of software work that is already becoming a commodity, not the part that retains pricing power.MORE FOR YOUGuo maps the economics in terms that should resonate with anyone who has sat through a SaaS pitch. A token spent answering a generic query is worth almost nothing because any model can supply the answer. A token spent reasoning over a specific company's private data is worth substantially more, because it delivers the output that company actually needs rather than a plausible approximation. The delta between those two token prices is where durable margin lives, and it is not a function of model capability. It is a function of data access, trust, and the accumulated cost of institutional integration.That integration cost is also a moat. Sierra AI charges only when its agent fully resolves a customer issue, nothing when the problem escalates to a human. That pricing structure is only sustainable for a company that has already earned the right to define what resolution means inside a specific client's workflow. Cognition offers a performance guarantee on Devin for the same structural reason: outcome-based pricing requires enough system access to verify the outcome. Both models are harder to replicate than the underlying model capability they run on.The same dynamic surfaces in the legal vertical. Harvey AI has published its own benchmark for legal work, effectively writing the definition of acceptable AI output for law firms that already use the product. The authority to set that standard came from adoption, not from training. A foundation lab cannot acquire that standing by releasing a better model, because the standing exists inside the profession, not inside the weights.The fear that foundation labs will eventually undercut the application layer by building first-party products has become a standard objection in venture pitches. Guo addresses it directly and the competitive structure of the market supports her position. The foundation model layer currently looks like a multi-way contest among OpenAI, Anthropic, Google, and a cohort of international challengers. ChatGPT held its lead in consumer chat through two years of genuine competition, and it is now losing share to Gemini, driven by Google's distribution advantages in Android and Search, not by a capability edge. Anthropic, widely regarded as running the most capable model at the moment, built its revenue base in enterprise and coding rather than consumer chat, suggesting that model quality alone does not translate to user acquisition even in the flagship application.For investors, the framework that emerges is a simple filter before it is a 2x2. Ask whether a company's value proposition depends on correctness that can only be verified inside private data, and whether that private environment requires access that takes years and institutional trust to obtain. Companies that satisfy both conditions are competing in what Guo calls the untrainable corner: territory where a smarter model is irrelevant because the bottleneck is permission, not intelligence. That corner is smaller than the broader AI application market, it is harder to enter, and the value it accumulates does not move when the next benchmark drops. The most cited benchmark score of any given week is, as Guo puts it, a map of territory about to become worthless.
AI Coding Agents Write 180% More Code But Ship Only 30% More Software
AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark scores as the real AI investment moat.











