Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant

At Spotify, data problems used to follow a specific pattern. You'd look for the relevant dashboard, there weren't any. You'd message the corresponding data expert on Slack, wait until they had time to help. But with thousands of teams moving fast, the demand for data insights had quietly outpaced what any individual expert could handle alone.To solve this problem, we started developing an AI data assistant, but with over 70,000 datasets at Spotify, amounting to petabytes of data, no single individual can claim knowledge of everything. Just putting all schemas into an LLM doesn’t work at this scale.

Figure 1: Spotify’s data Spotify's data platform processes 1.4 trillion data points daily across 70,000+ datasets.

For one, context windows are limited, even if it’s a million tokens. A million tokens are insufficient to accommodate a whole data warehouse. Secondly, schemas do not convey all the information. If a column has the INT64 type, then it doesn’t say anything about how those less than 100 are legacy test data and how they differ from actual data in terms of definitions or what is meant by “active user.” Provide the same number of tables to a model, and it will be confident in selecting the wrong one.We needed something in between. A layer that captures what actually matters about a slice of the warehouse, owned by people who own and understand the domain.

Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant | Spotify Engineering

Other newsrooms on this story

Related reading

How Spotify Deployed Kong's AI Gateway to Power Generative AI at Scale |…

Coding Is No Longer the Constraint: Scaling Developer Experience to Teams and…

Indexing the Data Lake for Online Point Queries | Spotify Engineering

Let’s Talk Agentic Development: Spotify x Anthropic Live | Spotify Engineering

Spotify is deleting millions of AI-generated music tracks to fend off spammers

How ‘Wrapped’ Insights Become Audience Segments | AdExchanger