Engineering & DevelopersYou've been asked to stand up a brand-new database cluster — a full replica of production, running real traffic, so you can validate a new release before it touches actual data. You're looking at the next day and a half, and it’s lookin’ stacked: provisioning and configuring dozens of nodes, joining them to the cluster one at a time, validating replication, wiring up dual-write pipelines, and babysitting the whole thing because any mistake on the ninth step means starting the whole thing over from scratch. While grinding through the whole process, you start to daydream: what if this whole ordeal took less than two hours? We found ourselves in exactly this situation. This is the story of how we got ourselves into this mess and how we made our way out of it.The Perl Script of Reckoning The Persistence Infrastructure team at Discord manages all kinds of database clusters, including Elasticsearch, Postgres, and ScyllaDB. Each of these databases has its own challenges, and we operate each at a pretty large scale, so there’s a lot on the plate of our 7-person team! ScyllaDB is the distributed database that stores messages, channels, servers, and most of Discord’s user data, so naturally it’s our service with the largest scope: dozens of clusters, with hundreds of database nodes in total.That ratio of engineers to database scale sounds somewhat manageable until you consider what managing all that infrastructure actually looks like: it’s rolling restarts after config changes and expanding clusters as traffic grows. It’s upgrading operating systems across hundreds of nodes without taking anything offline, and standing up entirely new clusters to validate new ScyllaDB releases before they touch production. None of these are fire-and-forget when you have siloed tools: they demand careful sequencing, validation, and sustained attention throughout.For years, we automated these operations the way many teams do when traffic scales dramatically: incrementally, under pressure, and without requiring a long-term strategy for where the tooling was headed. A Python script here, a bash script there… Our tools got the job done, but they were fragile and required significant institutional knowledge to operate safely. These scripts might’ve been considered our toolset’s final form if the operational demand had stayed constant. Unfortunately (but also fortunately), it did not, so we decided to build something more principled: the Scylla Control Plane, or SCP!Shadow Clusters: the Final BossScyllaDB upgrades are high-stakes. At Discord's scale, we regularly encounter edge cases that simply don't appear in smaller deployments. They’re the kind of bugs that only surface under the kind of load we run, and sometimes they only show up once every node in the cluster has been upgraded. As we’ve operated these clusters over the years, our data layer (in particular, our data services as mentioned in a past engineering blog) has unlocked all kinds of powerful tooling. One such tool is our shadow clusters: a short-lived, full replica cluster that receives, reads, and writes the same data as our production traffic. If the shadow cluster misbehaves under real load, we catch it before it touches production data. This setup has been so valuable in catching issues that we consider them standard practice before changing anything about our production cluster that may have big implications (OS, hardware, Scylla version, etc).Establishing a new shadow cluster manually is labor intensive, involving provisioning nodes, configuring them, joining them to the cluster, validating replication, establishing dual-write pipelines, and eventually tearing everything down. Repeat all that work for every Scylla cluster we run, and the complexity really starts to compound.Since we were aiming to upgrade our Scylla version in a safe manner, we badly needed automation that actually worked across all our clusters, so we set out to redesign SCP with all our prior pain and experience behind us.Lessons from the WreckageBefore writing a single line of SCP, we aligned on what had gone wrong and what we actually needed.The old scripts failed in three major ways: They were unsafe: meaning they were easy to run in the wrong order, against the wrong nodes, and with no precondition checks.They were unrecoverable: any failure between steps 7 to 12 meant starting over.They were hard to extend: adding a new operation often meant copying and modifying an existing script rather than composing existing pieces. For SCP, we had four goals:An extensible task framework: Adding a new operation should be straightforward — define the task's inputs, implement its logic, and it should work everywhere the framework works. New authors shouldn't need to understand the orchestration internals.Configurable parallelism: Some operations are safe to run on multiple nodes simultaneously, while others aren't. The framework should make it easy to express constraints like "never run this on nodes in different availability zones at the same time."Safety by default: Tasks should declare their preconditions. Transient failures should be retried automatically. State should be persisted, that way an interrupted job can be resumed without redoing completed work.Incremental delivery: Ship something usable, run it on real clusters, and adjust based on what we learn.Turns out, the last goal was the key to getting this off the ground. A framework that no one uses because it's too complex to onboard is worthless! Building SCP incrementally let us catch any usability problems early while they were still easy (and cheap) to fix, pushing us to keep investing in the tool instead of trying to build something huge and complicated right from the get-go.How SCP WorksSCP is built around a few layered concepts: tasks, workflows, and jobs.TasksA task is a single unit of work; it includes things like "drain this node," "check the repair status," or "run a cleanup." Tasks come in two flavors: node tasks operate on a single node, while cluster tasks coordinate across an entire cluster (which includes running individual node tasks across many nodes in the cluster).Between tasks, we often need to wait for the cluster to reach a desired state before it's OK to proceed. So, we establish some conditions: a special type of task that blocks execution until a criterion is satisfied. It verifies whether or not it’s safe to proceed by polling Scylla's API or Prometheus metrics until either the check passes, or it times out and surfaces an error.After restarting a Scylla node, you often need to wait for compactions to settle before considering the node as back to a normal state. If you move too quickly, you’ll risk cascading pressure across the entire cluster. Without an explicit condition check, you'd either hardcode a sleep — too short and you cause problems; too long and a rolling restart across 30 nodes takes all day — or accept that your operation might fail unpredictably. Conditions make the wait explicit, observable, and tunable.In Rust, tasks are defined using a trait that requires three things: A name() method, describing what the task is doing. A preconditions() method that lists conditions that must be true before the task runs.An execute() method that does all the work.struct Drain;