Every other week there's a new GPT-vs-Claude-vs-Gemini benchmark on coding or math or reasoning. None of them tell you whether the model can actually make a decision under uncertainty, where the answer isn't in the training data and the result shows up two weeks later in a P&L.
So I built a different kind of eval. Seven frontier LLMs, $100,000 of paper capital each, identical tools, identical prompts, identical data. Every Monday they pick stocks. The market grades them.
The project is 1rok. Live leaderboard: investingbench.vercel.app. The clock started January 20, 2026.
The contestants
GPT-5.5 (OpenAI)












