SWE-bench Multilingual: A Comprehensive Guide to the Multi-Language Programming Benchmark

Overview

SWE-bench, the authoritative benchmark for evaluating large language models' coding capabilities, has long focused primarily on Python. SWE-bench (Software Engineering Benchmark) was originally released by a research team at Princeton University in 2023. Its core concept involves extracting resolved Issues and corresponding Pull Requests from real GitHub repositories and transforming them into automated evaluation tasks. Unlike traditional code generation benchmarks (such as HumanEval and MBPP, which only test function-level code generation), SWE-bench requires models to understand the full codebase context, locate the relevant files, and generate patch code that passes tests—much closer to the daily workflow of real software engineers.

Now, the release of SWE-bench Multilingual extends the evaluation scope to 9 mainstream programming languages, providing a more comprehensive AI programming capability assessment framework that better reflects real-world development scenarios.

This benchmark contains 300 carefully curated tasks sourced from Pull Requests across 42 real GitHub repositories, covering C, C++, Go, Java, JavaScript/TypeScript, PHP, Ruby, and Rust. It aims to answer a critical question: Can large language models truly solve real software engineering problems in multilingual environments?

SWE-bench Multilingual Leaderboard

Benchmark Design and Language Distribution

Task Sources and Composition

All tasks in SWE-bench Multilingual come from real GitHub Pull Requests, spanning multiple domains including web frameworks, data processing tools, core utility libraries, and general-purpose libraries. Pull Requests (PRs) are the standard process for code review and merging in modern software development. Choosing PRs as the task source means each task has a clear problem description (Issue), a verified solution (merged code changes), and automated tests (to verify whether the fix is correct). These three elements form a natural evaluation loop: the model must generate a code patch based on the problem description, and that patch must pass the test cases added in the original PR to be considered successful. Each task undergoes rigorous verification to ensure clear problem definitions and unambiguous testing criteria.

The task distribution across languages is as follows:

Language	Number of Tasks
Ruby	44
Java	43
JS/TS	43
PHP	43
Rust	43
Go	42
C	30
C++	12

As shown, the task distribution is relatively balanced (except for C++), with approximately 42-44 tasks per language. C has 30 tasks, while C++ has only 12—closely related to the unique challenges C++ projects face in automated test construction: complex build systems (CMake, Bazel, etc.), extensive compilation dependencies, platform-specific configurations, and longer compilation times all increase the difficulty of environment setup. Additionally, C++'s template metaprogramming, multiple inheritance, and complex memory management make deterministic test verification more challenging, explaining why the number of available C++ tasks is relatively limited when building standardized evaluation environments.

Evaluation Methodology

SWE-bench Multilingual employs a standardized evaluation environment to ensure fair comparison across different language models. The core metric is % Resolved (resolution rate)—the percentage of instances successfully solved by the model out of the total 300 instances.

Evaluation dimensions include:

Resolution rate by repository: Reveals performance differences across project types
Resolution rate by language: Provides a clear view of the model's multilingual capability distribution
Cost vs. resolution rate: Measures the balance between efficiency and effectiveness
Step limit vs. resolution rate: Evaluates the model's reasoning efficiency

Relationship with the SWE-bench Family

SWE-bench has evolved into a complete benchmark family, with each subset serving a different focus:

SWE-bench Full: The complete dataset containing 2,294 instances
SWE-bench Verified: A subset of 500 high-quality instances manually curated
SWE-bench Lite: 300 selected instances designed to reduce evaluation costs
SWE-bench Multimodal: 517 problems involving visual elements
SWE-bench Multilingual: 300 tasks spanning 9 programming languages

The unique value of the Multilingual version lies in breaking the previous benchmarks' sole dependence on Python, more accurately reflecting the multilingual nature of modern software development. In real enterprise projects, a system often uses multiple languages simultaneously—for example, Go for backend microservices, TypeScript for the frontend, Rust for performance-sensitive modules, and C for low-level drivers—making multilingual evaluation capability crucial for measuring the practical utility of AI programming assistants.

Significance for the AI Programming Field

Revealing True Capability Boundaries

In reality, software engineers typically need to master multiple programming languages. The emergence of SWE-bench Multilingual enables more accurate assessment of AI programming assistants' actual capabilities in multilingual scenarios. A model that excels at Python doesn't necessarily handle Rust's ownership system or Go's concurrency patterns equally well.

Rust's Ownership system is its most distinctive language feature, guaranteeing memory safety without garbage collection through a compile-time Borrow Checker. This requires AI models to precisely understand variable lifetimes, rules for mutable and immutable references, and ownership transfer semantics. For AI models, generating code that complies with ownership rules is far more difficult than generating syntactically correct Python code, as the compiler strictly rejects any code that violates these rules.

Similarly, Go has built-in goroutines and channels as concurrency primitives. Its CSP (Communicating Sequential Processes) concurrency model requires correct handling of goroutine lifecycle management, channel directionality and buffering strategies, and race condition avoidance. When processing Go concurrent code, AI models need not only syntactic correctness but also the ability to reason about program behavior during concurrent execution, involving deep understanding of timing, synchronization, and resource sharing.

Driving Balanced Model Development

Through language-specific evaluation results, researchers and developers can clearly identify a model's weaknesses in specific languages, enabling targeted improvements to training data and strategies. This is essential for building truly general-purpose AI programming assistants. In current LLM training data, Python code typically accounts for a much higher proportion than other languages (related to the numerical dominance of Python repositories on GitHub and Python's widespread use in education and data science), and this data imbalance directly leads to capability disparities across languages.

Standardized Evaluation Framework

This benchmark supports multiple evaluation perspectives, including comparisons between open-source and closed-source models, comparisons across different Agent frameworks (such as mini-SWE-agent), and cost-effectiveness analysis. In SWE-bench evaluation, Agent frameworks serve as the interaction layer between models and code repositories. Rather than generating code in a single pass, Agents simulate a developer's workflow: browsing file structures, searching relevant code, reading documentation, editing files, running tests, and iteratively modifying based on feedback. Different Agent frameworks vary in tool design, prompting strategies, and interaction loops, and these design choices significantly impact final task resolution rates. This provides the industry with a unified measurement standard.

Summary and Outlook

SWE-bench Multilingual fills the gap in multilingual AI programming evaluation. Its 300 tasks from real projects provide a rigorous and practical assessment framework. As more models participate in evaluation, we will gain deeper insights into AI's multilingual programming capabilities.

For developers, paying attention to how models perform in their frequently used languages will help select the most suitable AI programming tools. For researchers, this benchmark highlights challenges that still need to be overcome in multilingual code generation. In the future, as more languages (such as Kotlin, Swift, Scala, etc.) and more task types (such as cross-language interoperability and multi-repository collaboration) are added, SWE-bench Multilingual is poised to become the gold standard for evaluating AI programming capabilities.