LLM Gateways: Routing, Fallbacks, And Semantic Caching

Here's a line of code that's quietly running in production at a surprising number of...

venerdì 19 giugno 2026 New tab

2,360 words~11 min read

Here's a line of code that's quietly running in production at a surprising number of companies:

const response = await openai.chat.completions.create({ model: "gpt-4o", messages });

Enter fullscreen mode

Exit fullscreen mode

It looks harmless. It's also why your AI bill is whatever it is this month, why your app goes down the moment OpenAI has a bad afternoon, and why the same question typed by ten thousand users costs you ten thousand inference calls. That one line hardcodes a vendor, a model, a pricing tier, and a single point of failure all at once.

LLM Gateways: Routing, Fallbacks, And Semantic Caching

LLM Gateways: Routing, Fallbacks, And Semantic Caching

Related reading

How I Cut My AI Bill by Caching LLM Responses in Node.js

Building a Conversational AI with Claude and ChatGPT APIs: A Practical Guide

AI Agents in Production: Error Handling, Fallbacks, and Cost Control

LLM Fallback in Production, Agentic eCommerce, and GitHub Copilot for Parallel…

Introducing Gateway API Inference Extension

I Built a Local AI Gateway That Talks to Claude, ChatGPT, DeepSeek and Gemini —…

Related reading

How I Cut My AI Bill by Caching LLM Responses in Node.js

Building a Conversational AI with Claude and ChatGPT APIs: A Practical Guide

AI Agents in Production: Error Handling, Fallbacks, and Cost Control

LLM Fallback in Production, Agentic eCommerce, and GitHub Copilot for Parallel…

Introducing Gateway API Inference Extension

I Built a Local AI Gateway That Talks to Claude, ChatGPT, DeepSeek and Gemini —…