Quick answer: Twitch has no public API for VOD chat replay. To build a Twitch toxicity classifier dataset you walk the internal VideoCommentsByOffsetOrCursor GraphQL endpoint at scale — the same one the web player uses. The Devil Scrapes Twitch VOD Chat Archive Actor does that for $0.001 per message (~$1.05 per 1,000), returning the structured fields — message_fragments, badges, is_subscriber — that make classifier features actually useful.
If you maintain a mod-bot (StreamElements, Nightbot, Streamlabs, or custom), or if you are an ML engineer building a Twitch-native toxicity model, your training data problem is the same: you need labeled-able chat messages at scale from real VODs, with enough context per row to build signal-rich features. This post walks the full pipeline — pulling the data, loading it into pandas, training a baseline TF-IDF + logistic-regression classifier, and sketching the upgrade path to a transformer.
Does Twitch have an API for chat training data? 🔎
Not in any useful sense. The Twitch Helix API exposes live IRC chat via EventSub and the Chat & Messaging endpoints, but it has no endpoint for VOD chat replay — the historical timestamped record of a past broadcast. That data exists (you can watch it in the VOD player), but the only programmatic surface for it is the internal VideoCommentsByOffsetOrCursor persisted GraphQL query.







