Authors: John Martin and Adam Bellemare
Shopify’s data warehouse has gone through many iterations since the company's founding in 2004. Since then, the data warehouse has evolved and grown into a data lake, compromised of multiple storage mechanisms, systems, and consumers. Along the way, Shopify has moved to the cloud and adopted open-source big data tools such as Apache Spark, Presto, and dbt.At its conception, Shopify’s data warehouse was implemented specifically for internal analytics. As it has grown, the platform has become the e-commerce backbone for many of our one million+ merchants. Many of these merchants require insight and access to the vast amount of data generated by their business on the Shopify platform.Problems emerged as our company grew. The data lake couldn’t satisfy the ever-growing demand for new and enhanced merchant analytics. These limitations were in part due to the relatively slow query times of our OLAP databases, the limited support for serving structured datasets from our data lake, and the slow batch nature of our ingest pipelines. Our business requirements saw us needing faster ingest and semi-structured data with low-latency query response times, which led to us building the dedicated Merchant Analytics Platform. This brings us to today’s data platform architecture:















