TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.
Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.
The cloud bill was not lovely.
They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.
I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.













