At Spotify, data problems used to follow a specific pattern. You'd look for the relevant dashboard, there weren't any. You'd message the corresponding data expert on Slack, wait until they had time to help. But with thousands of teams moving fast, the demand for data insights had quietly outpaced what any individual expert could handle alone.To solve this problem, we started developing an AI data assistant, but with over 70,000 datasets at Spotify, amounting to petabytes of data, no single individual can claim knowledge of everything. Just putting all schemas into an LLM doesn’t work at this scale.
Figure 1: Spotify’s data Spotify's data platform processes 1.4 trillion data points daily across 70,000+ datasets.
For one, context windows are limited, even if it’s a million tokens. A million tokens are insufficient to accommodate a whole data warehouse. Secondly, schemas do not convey all the information. If a column has the INT64 type, then it doesn’t say anything about how those less than 100 are legacy test data and how they differ from actual data in terms of definitions or what is meant by “active user.” Provide the same number of tables to a model, and it will be confident in selecting the wrong one.We needed something in between. A layer that captures what actually matters about a slice of the warehouse, owned by people who own and understand the domain.












