LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions?

This is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail.

Sources: arXiv 2506.17335 | ACL Anthology | GitHub

What the benchmark actually tests

LMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure:

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

Other newsrooms on this story

Related reading

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

Other newsrooms on this story

Related reading

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer…

An open source LLM eval tool with two independent quality signals

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in…

What do LLMs think when you don't tell them what to think about?