SWE-bench for LLMs

Large language models (LLMs) have evolved from text generators into powerful systems that can reason, code, and even act autonomously within complex workflows. SWE-bench is a benchmark designed to evaluate large language models on complex, real-world software engineering tasks drawn from GitHub [1]. Given a codebase and a GitHub issue, a model must produce a patch that resolves the problem described. Each instance in the dataset consists of a failing test (Fail-to-Pass) that is fixed by the corresponding pull request along with additional tests to ensure no unrelated behavior is broken (Pass-to-Pass). The tasks come from actual GitHub issues and their associated fixes, making the evaluation grounded in how software evolves in practice.

Figure 1. SWE-bench evaluation pipeline

The SWE-bench evaluation pipeline has two main stages: generation and evaluation. In the generation stage, the model is prompted with an issue description and relevant repository context, and it produces a candidate patch. In the evaluation stage, this patch is tested inside a controlled Docker environment, where the repository is cloned, the model’s patch is applied to the targeted file(s), and the full test suite is run to check whether the issue is resolved without breaking other functionality.