Photo by Sean Gallup/Getty Images
The era of free AI training data is over. Reddit $RDDT +6.44% charges millions for API access. The New York Times sued. Publishers are blocking scrapers. Even if AI companies could still vacuum up the public internet, they're running into a bigger problem: they need different kinds of data entirely for the next leap in abilities.
Large language models were built by scraping text and images from the web. But as AI systems move beyond chatbots, they need training data that was never publicly available in the first place. Data that's locked away, or scattered, or doesn't even exist yet.
New markets are emerging to unlock these sources. Here are three.
Most people think of personal data as Social Security numbers and health records. But nearly everything you do online generates data that platforms collect and use — your Spotify $SPOT -2.46% listening history, your email patterns, the documents you write in Google $GOOGL +0.58% Docs, your conversations with ChatGPT.











