We pre-registered, ran, and verified the macro ablation: information per joule, measured

Maker disclosure: I build Macrokit (Apache-2.0, fully open). This is the data, not a pitch — links and the raw runs at the end.

The multi-model benchmark answered: can off-the-shelf local models do real GitHub-maintainer work? (Yes — four of them, 74–82.5% on a pre-registered 100-task corpus.) It didn't answer the more interesting question: why is moving the reasoning to design-time the efficient move, not just a trick? So we ran a direct test — the macro ablation.

Pre-registered and frozen. We committed the whole protocol — the two conditions, the trajectory→intent decode rule, the metric, and the prediction — before running a single MACRO-OFF trial. The git timestamp on bench/MACRO_ABLATION_PREREGISTRATION.md is the audit trail. No post-hoc edits; the pre-registration is frozen. Same committed 100-task corpus, same router and tool-calling machinery, temperature 0; the only thing that changes is the tool set:

MACRO-OFF (reason it live) — the model is given low-level primitives only and must compose the multi-step workflow itself at runtime.

MACRO-ON (the macro) — the workflow is encoded once at design time; at runtime the model only perceives intent and dispatches it in a single routing call.

We pre-registered, ran, and verified the macro ablation: information per joule, measured

Related reading

A tiny local model doing real GitHub-maintainer work in your browser — and the…

I open-sourced a macro execution layer to reduce coding-agent turns (60-task…

Is it agentic enough? Benchmarking open models on your own tooling

Stop guessing whether a cheaper model can do the job. Grab the bakeoff guide:…

A local model opened 41 of our pull requests in five weeks. The model is the…

A 13 KB text file beat a smarter model: benchmarking AI codegen across 5…