Posted August 14, 2023 by zeke

You know when you’re using ChatGPT or Vercel’s AI playground and it returns an animated response, rendered word by word? That’s not just a dramatic visual effect to make it look like there’s a robot typing on the other side of the conversation. That’s actually the language model generating tokens one at a time, and streaming them back to you while it’s running.

Replicate already provides ways for you to receive incremental updates as your predictions are running, through polling and webhooks. But those aren’t always the most efficient methods to get updates from a running model. When you’re building something like a chat app, what you really need is a live-updating event stream.

Replicate’s API now supports server-sent event streams for language models. This lets you update your app live, as the model is running. In this post we’ll show you how to consume streaming responses from language models on Replicate.

How streaming works