Back to Articles

A few weeks ago we released physics-intern, an autonomous agent for physics research. You gave it a problem in plain language (like "derive the Hawking temperature from the Euclidean path integral") and it ran the whole thing on its own: first, analyzing the question and decomposing the problem into pieces, then dispatching derivations to specialised sub-agents, writing and running verification code, finally critiquing its own results, and handing back a finished answer.

Nine roles with different instructions were orchestrated into a fixed pipeline, and it could run in one go, with no human in the loop.

That rigid design was deliberate, and it was there for a good reason: we built it to be measured. We wanted hard evidence that the structure we were betting on (divide the research problem into pieces to work each in a fresh context, cross-check and criticize, etc.) actually buys you something on difficult physics.

The way you get that evidence is to run on a benchmark like CritPt, and obviously such a benchmark cannot have a human in the loop. So our framework had to be fully autonomous. Ultimately it wasn't the goal, but it was the price of the experiment.