There are benchmarks for code an LLM writes. HumanEval, MBPP, SWE-Bench, LiveCodeBench. There are no benchmarks for the specifications an LLM writes. The upstream half of agentic software delivery has been flying blind — and the spec is what your downstream coding agent has to interpret.

I went looking for one and there isn't one. So I propose one, and to demonstrate it I gave thirteen LLMs the same real codebase (excalidraw) and asked each of them to produce a specification tree. Six of those thirteen ran locally on a laptop - via LM Studio and Ollama - and one of them landed within 12% of the frontier-cloud baseline. Then I made Claude Opus walk through every other model's output and judge it.

The numbers surprised me. So did how well the local half held up.

The metric: driftless implementability

A spec compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. A bad function fails its test. A bad spec fails a meeting.