PhysicsIntern: from an Autonomous Benchmark-runner to a Research Sidekick

Back to Articles

A few weeks ago we released physics-intern, an autonomous agent for physics research. You gave it a problem in plain language (like "derive the Hawking temperature from the Euclidean path integral") and it ran the whole thing on its own: first, analyzing the question and decomposing the problem into pieces, then dispatching derivations to specialised sub-agents, writing and running verification code, finally critiquing its own results, and handing back a finished answer.

Nine roles with different instructions were orchestrated into a fixed pipeline, and it could run in one go, with no human in the loop.

That rigid design was deliberate, and it was there for a good reason: we built it to be measured. We wanted hard evidence that the structure we were betting on (divide the research problem into pieces to work each in a fresh context, cross-check and criticize, etc.) actually buys you something on difficult physics.

The way you get that evidence is to run on a benchmark like CritPt, and obviously such a benchmark cannot have a human in the loop. So our framework had to be fully autonomous. Ultimately it wasn't the goal, but it was the price of the experiment.

Back to Articles

Nine roles with different instructions were orchestrated into a fixed pipeline, and it could run in one go, with no human in the loop.

PhysicsIntern: from an Autonomous Benchmark-runner to a Research Sidekick

PhysicsIntern: from an Autonomous Benchmark-runner to a Research Sidekick

Other newsrooms on this story

Related reading

Towards self-driving codebases · Cursor

Revisiting Benchmarking- Building a Rust A2A Agent

⚔️ I Ran the Same Task Through Hermes Agent, LangGraph, and AutoGen — Here's…

How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

AutoResearch on Diffusers' Pipeline for 10 Rounds on JarvisLabs

ResearchMind — AI Research Pipeline with Cross-Session Memory | Backboard…

Related reading

Towards self-driving codebases · Cursor

Revisiting Benchmarking- Building a Rust A2A Agent

⚔️ I Ran the Same Task Through Hermes Agent, LangGraph, and AutoGen — Here's…

How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

AutoResearch on Diffusers' Pipeline for 10 Rounds on JarvisLabs

ResearchMind — AI Research Pipeline with Cross-Session Memory | Backboard…

Other newsrooms on this story