Tracing Discord's Elixir Systems (Without Melting Everything)

Engineering & DevelopersNick KrichevskyMarch 4, 2026At Discord, we want the experience of chatting with your friends, reacting to a message, or posting artisanal farm-to-channel memes to feel instantaneous. We’ve managed to achieve this at scale by leveraging Elixir’s powerful concurrency mechanisms to run each Discord server (which we call a “guild” internally) fully independently from one another. Sometimes though, things go wrong, and a guild can’t keep up with its user activity. When this happens, the guild will feel laggy or possibly experience a complete outage. If the system degrades beyond the point it can self-heal, an on-call engineer has to intervene. Afterwards, they turn to our observability tools to understand the cause and how to stop it from recurring.Our on-call engineer’s investigation begins by looking at metrics and logs. We have a wide array of instrumentation, including measurements of how frequently we process each user action type and how long processing takes. These often provide useful hints about bursty activity, like a flurry of hype and reactions on that sweet new game that just got shadow-dropped, but even if we find an inciting event, it’s tricky to gauge what the experience was for users. Think of it like your car’s dashboard: it can tell you what the engine temperature is, but not the consequences of it running hot.If that doesn’t yield results, the on-call engineer turns to our custom-built tool called “guild timings.” Every time a guild processes an action, it records how much of the current minute has been spent on each action type to an in-memory store. This data is much more detailed than our metrics, but it’s emitted at such a high volume that we can’t feasibly store it all. As such, this data is rotated frequently for all but our largest guilds. Even if we retrieve the data in time, it still won’t give us a good picture of the end-to-end experience, as it doesn’t capture downstream effects.Other teams at Discord have derived enormous value from utilizing distributed tracing (a.k.a. Application Performance Monitoring), which allows them to see how long the constituent parts of an operation took. Adding tracing to our Elixir stack took a bit of work, though. Most tracing tools work by passing information about the operation via metadata layers like HTTP headers, but Elixir’s built-in communication tools don’t have an equivalent layer out of the box. So… we had to build our own. Despite the fact that we were changing how our services communicate with one another, we managed to integrate it without downtime.Setting the Stage: Elixir at Discord, and TracingElixir, and How it Powers DiscordWhenever you do something on Discord, your action turns into a “message” in our Elixir stack, which is then forwarded to connected clients. Elixir programs consist of lightweight processes (scheduled by the runtime, not the OS) that communicate via message passing, making concurrent programming a breeze. This model allows us to trivially distribute our programs across many nodes, too.Processes are the building blocks of Discord’s architecture. Every guild runs as an Elixir process that uses message passing to fan out actions to all connected “sessions”; each session is itself an Elixir process that forwards actions to clients. Therefore, when we talk about a user action flowing through the Elixir stack, we’re actually talking about the act of messages being passed between processes. To capture an end-to-end performance story, we needed a solution that allows us to follow a particular message’s path throughout the system, which is precisely what tracing provides us. A Gentle Introduction to TracingWhen a user sends a Discord message, their client sends it over HTTP to our API service, which records it in the database and sends it to the Elixir stack via gRPC. As the API service processes the message, its tracing library times each step and records the measurements in a way that lets us visualize how each step affected overall execution. Each region of code timed by a tracing library is called a span. Every time a span starts, the library links it to the currently active one (if any), building a tree of nested spans. This tree forms a trace, which we can use to see a timeline of events during execution.Below is part of a trace from someone sending a message on Discord. As discord_api processed the message, it recorded a span called message_common.dispatch_message that took 1.69 ms. It then sent the message to discord-guilds, which spent 357 μs fanning it out to discord-sessions. From there, each session recorded its own span while forwarding the message. All these spans linked together form our complete trace!‍A trace from Discord’s Python API, with the spans generated from broadcasting (“dispatching”) a user’s message. Note: discord-guilds represents the actual guilds service, while discord_guilds is an RPC client in the API service.While the final structure can look complex, spans are quite simple to record in code. If you’ve ever written code like this, you’ve effectively created spans!def dispatch_message():

Tracing Discord's Elixir Systems (Without Melting Everything)

Tracing Discord's Elixir Systems (Without Melting Everything)

Related reading

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute…

Cost Attribution in Discord’s API

Discord Social SDK Updates & Integrations

Stop Using Discord as a Bug Tracker: How I Built a Lean, AI-Powered Feedback…

How Discord Automates ScyllaDB Clusters at Scale

Staff Picks, May 2025: The Games That Brought Us to Discord

Related reading

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute…

Cost Attribution in Discord’s API

Discord Social SDK Updates & Integrations

Stop Using Discord as a Bug Tracker: How I Built a Lean, AI-Powered Feedback…

How Discord Automates ScyllaDB Clusters at Scale

Staff Picks, May 2025: The Games That Brought Us to Discord