GPT and Claude failed Bridgewater's finance tests because the right answers were never public

The hedge fund Bridgewater and Thinking Machines Lab report that a finely tuned open-weight model outperforms the most powerful AI models in the evaluation of financial documents, at a fraction of the cost. The figures come from their own analysis.

venerdì 3 luglio 2026 New tab

Hedge fund Bridgewater and Thinking Machines Lab say a fine-tuned open-weight model outperforms the strongest AI models at evaluating financial documents, at a fraction of the cost. The numbers come from their own internal evaluation.

Investors get buried in news, analysis, corporate filings, and emails every day. According to a report from Bridgewater's AIA Labs and Thinking Machines Lab, the startup founded by former OpenAI CTO Mira Murati, reading isn't the real work. The real work is the constant stream of small, repeated judgment calls about what actually matters. That's the triage the researchers wanted to automate.

They defined six tasks drawn from an investor's daily routine. One example: deciding whether a financial article is relevant to an executive. Another: whether a central bank document signals the direction of future rate changes. For investors, these calls are trivial, but they can barely put their reasoning into words. The report gives a telling example. A headline about Trump's claim to Greenland gets flagged as irrelevant, while Trump's threat of new China tariffs is highly relevant. Both touch on geopolitics and finance.

Frontier models failed in the authors' tests. Variants of Gemini, Claude, and GPT hit only about 50 percent accuracy with a basic prompt. Expert-written instructions and a three-tier rating system ("relevant and interesting," "relevant but uninteresting," "irrelevant") pushed accuracy into the mid-70s. That still fell short of the 80 percent threshold the authors set for trustworthy deployment.

GPT and Claude failed Bridgewater's finance tests because the right answers were never public

GPT and Claude failed Bridgewater's finance tests because the right answers were never public

Other newsrooms on this story

Related reading

Thinking Machines partners with Bridgewater to build AI model that cuts errors…

AI's Finance Problem Is Quantified — And That's Bullish for the Builders

How AI May Be Undermining Your Investments

AI Evaluators Struggle with Models That Know When They’re Being Tested

Study finds AI trading strategies underperform buy-and-hold investing over…

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam…

Other newsrooms on this story

Related reading

Thinking Machines partners with Bridgewater to build AI model that cuts errors…

AI's Finance Problem Is Quantified — And That's Bullish for the Builders

How AI May Be Undermining Your Investments

AI Evaluators Struggle with Models That Know When They’re Being Tested

Study finds AI trading strategies underperform buy-and-hold investing over…

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam…