A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions?
This is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail.
Sources: arXiv 2506.17335 | ACL Anthology | GitHub
What the benchmark actually tests
LMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure:














