NYU Stern School of Business professor Srikanth Jagabathula is co-author of Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III — reputedly the first paper to show that general-purpose AI models can pass the finance industry’s toughest exam.
Srikanth, can you talk me through the hypothesis you were looking to test?
Large language models have shown immense capabilities across a wide range of domains, and their capabilities have improved by leaps and bounds over the last several years. So we started by thinking about the capabilities of LLMs in specialised, high-stakes domains. Finance, like any specialised domain, has a lot of concepts that are very particular to the topic — particular terminology that is very particular to the domain.
So when we take a large language model that is trained across a wide variety of data sources, the question is whether we can say that these models have the capabilities to work well out of the box. That was the key question we wanted to answer. It was a valuable opportunity to create a benchmark, evaluate the LLMs, and understand how far their capabilities have reached.
A good benchmark needs to have certain characteristics or qualities. It needs to be representative of the skill set that’s needed in that particular domain. It needs to be widely regarded as the right benchmark by people in the community. So if you show good performance on the benchmark, people should believe that it actually translates to real-world performance. For financial advising, CFA is the gold standard.







