2026-04-2019 min readCode review is a fantastic mechanism for catching bugs and sharing knowledge, but it is also one of the most reliable ways to bottleneck an engineering team. A merge request sits in a queue, a reviewer eventually context-switches to read the diff, they leave a handful of nitpicks about variable naming, the author responds, and the cycle repeats. Across our internal projects, the median wait time for a first review was often measured in hours.When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.So, we jumped to the next most obvious path, which was to grab a git diff, shove it into a half-baked prompt, and ask a large language model to find bugs. The results were exactly as noisy as you might expect, with a flood of vague suggestions, hallucinated syntax errors, and helpful advice to "consider adding error handling" on functions that already had it. We realised pretty quickly that a naive summarisation approach wasn't going to give us the results we wanted, especially on complex codebases.Instead of building a monolithic code review agent from scratch, we decided to build a CI-native orchestration system around OpenCode, an open-source coding agent. Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents. Rather than relying on one model with a massive, generic prompt, we launch up to seven specialised reviewers covering security, performance, code quality, documentation, release management, and compliance with our internal Engineering Codex. These specialists are managed by a coordinator agent that deduplicates their findings, judges the actual severity of the issues, and posts a single structured review comment.We've been running this system internally across tens of thousands of merge requests. It approves clean code, flags real bugs with impressive accuracy, and actively blocks merges when it finds genuine, serious problems or security vulnerabilities. This is just one of the many ways we’re improving our engineering resiliency as part of Code Orange: Fail Small.This post is a deep dive into how we built it, the architecture we landed on, and the specific engineering problems you run into when you try to put LLMs in the critical path of your CI/CD pipeline, and more critically, in the way of engineers trying to ship code.