State-of-the-art LLMs can now solve a majority of scoped coding problems, from function implementation to file-level refactoring. But there’s still an unquantified gap between that coding capability and the ability to fully autonomously manage software engineering projects. Real-world software engineering is a long-horizon activity that requires planning, persistent state management, and recovery from failure. Even for an API such as Stripe’s, which is built for ease of use, shipping an integration end to end involves plenty of cross-domain “glue” work between handling new APIs, testing frontends, and migrating databases. We wanted to answer this question: can agents autonomously build complete Stripe integrations? When it comes to businesses running on Stripe, a mostly correct integration is a failure; payments require 100% accuracy. What matters is not just an agent’s ability to generate code, but its capacity to verify, test, and validate that code with the rigor of a human engineer. To evaluate this, we set out to answer a few related questions:
How well do models understand the Stripe API?
Can agents author correct code across the backend and frontend components of a Stripe integration?






