I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop.

There are benchmarks for code an LLM writes. HumanEval, MBPP, SWE-Bench, LiveCodeBench. There are no benchmarks for the specifications an LLM writes. The upstream half of agentic software delivery has been flying blind — and the spec is what your downstream coding agent has to interpret.

I went looking for one and there isn't one. So I propose one, and to demonstrate it I gave thirteen LLMs the same real codebase (excalidraw) and asked each of them to produce a specification tree. Six of those thirteen ran locally on a laptop - via LM Studio and Ollama - and one of them landed within 12% of the frontier-cloud baseline. Then I made Claude Opus walk through every other model's output and judge it.

The numbers surprised me. So did how well the local half held up.

The metric: driftless implementability

A spec compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. A bad function fails its test. A bad spec fails a meeting.

The numbers surprised me. So did how well the local half held up.

The metric: driftless implementability

A spec compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. A bad function fails its test. A bad spec fails a meeting.

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop.

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop.

Related reading

We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

The Best LLMs for Agentic Coding in 2026 (Real-World, Not Just Benchmarks)

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

An open source LLM eval tool with two independent quality signals

The Best Open Source LLMs for Coding Right Now (June 2026)

Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch

Related reading

We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

The Best LLMs for Agentic Coding in 2026 (Real-World, Not Just Benchmarks)

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

An open source LLM eval tool with two independent quality signals

The Best Open Source LLMs for Coding Right Now (June 2026)

Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch