Semantic caching the VLM step in our product-photo pipeline

TL;DR: We put Bifrost in front of the VLM step that captions and rewrites prompts for our product-photo diffusion pipeline. Semantic caching cut that bill by ~62% in three weeks. The diffusion side, where the GPUs live, was never the cost we should have been worrying about.

The bill that surprised us

Our pipeline at Photoroom (paraphrased, not exact internal numbers) does three things per product image. A vision-language model reads the input and produces structured captions. A second LLM call rewrites the user's prompt into something the diffusion model behaves well with. Then SDXL with our internal LoRAs does the actual generation on our own A100s.

The diffusion step is what we obsess over. To be precise, it is what we benchmark and profile every sprint. So when we looked at the Q1 numbers, the surprise was that Claude and Gemini Vision together cost more than the GPU lease for the same workload. The VLM and prompt-rewrite layer was 58% of total inference spend.

The nuance here is that we had been calling the providers directly from a Python service with no caching. Same product image, same user request. The response paid for again.

Semantic caching the VLM step in our product-photo pipeline

Other newsrooms on this story

Related reading

Unifying image inputs across three vision providers behind Bifrost

Semantic caching our flaky-test summariser: 58% fewer LLM calls

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

I Cut Our Image Captioning Costs 60% — Here's the Backend Story

How We Reduced LLM Costs by 95%: Cache + Batch + Cascade in PHP

Flash-Decoding for long-context inference