Stop paying for idle GPUs in your CI: batching LLM eval jobs

TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.

Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.

The cloud bill was not lovely.

They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.

I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.

The cloud bill was not lovely.

I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.

Stop paying for idle GPUs in your CI: batching LLM eval jobs

Stop paying for idle GPUs in your CI: batching LLM eval jobs

Related reading

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

3-Part Series: LLM Latency in Production (Part 1) | Towards AI

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

Related reading

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

Why Your Local LLM Setup Is Costing More Than You Think — And What Happens When…

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4…

3-Part Series: LLM Latency in Production (Part 1) | Towards AI

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…