Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

SWE-bench for LLMs

Large language models (LLMs) have evolved from text generators into powerful systems that can reason, code, and even act autonomously within complex workflows. SWE-bench is a benchmark designed to evaluate large language models on complex, real-world software engineering tasks drawn from GitHub [1]. Given a codebase and a GitHub issue, a model must produce a patch that resolves the problem described. Each instance in the dataset consists of a failing test (Fail-to-Pass) that is fixed by the corresponding pull request along with additional tests to ensure no unrelated behavior is broken (Pass-to-Pass). The tasks come from actual GitHub issues and their associated fixes, making the evaluation grounded in how software evolves in practice.

Figure 1. SWE-bench evaluation pipeline

The SWE-bench evaluation pipeline has two main stages: generation and evaluation. In the generation stage, the model is prompted with an issue description and relevant repository context, and it produces a candidate patch. In the evaluation stage, this patch is tested inside a controlled Docker environment, where the repository is cloned, the model’s patch is applied to the targeted file(s), and the full test suite is run to check whether the issue is resolved without breaking other functionality.

SWE-bench for LLMs

Figure 1. SWE-bench evaluation pipeline

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

Related reading

How to evaluate and benchmark Large Language Models (LLMs)

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My…

Small language models: Rethinking enterprise AI architecture

New AI add-on helps developers automate everyday programming tasks

IEEE Rolls Out Large Language Models Virtual Training Course

Exploring LLM-as-a-Judge

Related reading

How to evaluate and benchmark Large Language Models (LLMs)

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My…

Small language models: Rethinking enterprise AI architecture

New AI add-on helps developers automate everyday programming tasks

IEEE Rolls Out Large Language Models Virtual Training Course

Exploring LLM-as-a-Judge