mini-SWE-agent Roulette Mode: Why Randomly Switching Between LLMs Actually Performs Better

A Counterintuitive Discovery: Randomly Switching Models Outperforms Any Single Model

What happens if your AI Agent randomly selects a different large language model at each reasoning step? Intuitively, this sounds like a chaotic strategy. But the latest experiment from the SWE-agent team shows that letting mini-SWE-agent randomly switch between GPT-5 and Claude Sonnet 4 actually scores higher on SWE-bench than either model running alone.

This experiment, dubbed "Roulette Mode," involves an extremely simple core change—just replacing model.query(history) with random.choice([model1, model2]).query(history), and nothing else. The prompt stays exactly the same, the architecture stays exactly the same, yet it achieves a 1+1>2 effect.

rss source: [mini-SWE-agent] Roulette mode!

What is mini-SWE-agent: A Minimalist Agent in Under 100 Lines of Code

To understand the significance of this experiment, you first need to understand mini-SWE-agent's design philosophy. Unlike the feature-rich swe-agent, mini is a minimalist software engineering Agent whose core Agent class is under 100 lines of code, yet achieves competitive scores on the SWE-bench leaderboard.

Three core design principles of mini-SWE-agent:

The only tool is bash: It doesn't use any LM tool-calling interfaces, meaning it can work with any model. When running in a sandbox environment, it doesn't even need any additional packages installed.
Completely linear history: Each step simply appends to the message list. The trajectory is identical to the messages passed to the LM, making it ideal for debugging and fine-tuning.
Stateless execution: Each action is executed independently via subprocess.run, rather than maintaining a stateful shell session. This makes sandbox execution and scaling extremely simple—just replace subprocess.run with docker exec.

It's precisely this minimalist design that makes implementing Roulette Mode exceptionally clean. No complex routing logic, no model selection strategy—just pure random switching.

Roulette Mode Results: Synergistic Gains from GPT-5 and Sonnet 4

Full SWE-bench Test: The Combination Beats Individual Models

On the full 500-instance SWE-bench verified test, the random combination of GPT-5 and Sonnet 4 achieved a higher score than either model running independently. The team also tried an alternating mode (rather than random switching), solving 333 out of 500 instances (66.6%), which also surpassed both models' individual performance.

Small-Scale Multi-Model Combination Comparison

On 50 randomly sampled instances, the team tested more combinations:

Model Combination	Score (50 instances)
GPT-5 + Sonnet 4	39
GPT-5 + Sonnet 4 + Gemini 2.5 Pro	33
GPT-5 + Gemini 2.5 Pro	31
GPT-5 + GPT-5-mini	31
GPT-5 mini + GPT-5 nano	20

Compared to single-model baselines:

Single Model	Score (50 instances)
Sonnet 4	33
GPT-5	32
GPT-5-mini	32
Gemini 2.5 Pro	29
GPT-5-nano	16

A key finding: only the GPT-5 and Sonnet 4 combination achieved performance beyond either individual model. These two models happen to be in a closely matched competitive state. When the performance gap between combined models is large (e.g., GPT-5 and Gemini 2.5 Pro), the combined score and cost simply fall between the two models, producing no synergistic gain.

It's worth noting that the 50-instance sample size has statistical power limitations. The team candidly pointed out that the randomly sampled subset happened to skew easier, with all scores slightly higher than those on the full 500-instance set.

Cost Analysis: What's the Cost-Effectiveness of Roulette Mode?

Relationship Between Performance and Inference Steps

Experiments show that Roulette Mode's performance gains reach diminishing marginal returns at around 50 steps. Interestingly, this curve more closely resembles GPT-5's behavior pattern rather than Sonnet 4's slow-climbing characteristic. The team speculates this is because either model can decide to submit results and end the run, so the overall behavior more closely resembles the model that "submits earlier."

Approximately 30 Cents Per Instance

At the maximum performance point, the average cost per instance is approximately 30 cents, essentially falling between the individual costs of the two models. Cost grows in an S-curve pattern, echoing the diminishing marginal returns in performance improvement.

You might not have noticed, but cost analysis for Agent-based systems is inherently tricky—most overhead is spent on instances that can't be solved, so average cost is highly dependent on runtime limits (such as step count limits).

Why Does Randomly Switching LLMs Improve Performance?

While the team hasn't provided a definitive theoretical explanation, the phenomenon can be understood from several angles:

Diversity Hypothesis: Different models have their own strengths and blind spots across different types of programming tasks. Random switching effectively introduces "cognitive diversity," reducing the probability of a single model consistently making errors on specific types of problems. This is similar to the core idea behind Ensemble Methods in machine learning—combining multiple weakly correlated predictors often outperforms any single predictor.

Breaking Mental Fixation: When one model gets stuck in a particular direction, switching to another model may bring an entirely different problem-solving approach, acting as an implicit "restart" mechanism.

Performance Parity is a Prerequisite: This effect only appears between models of comparable performance, suggesting that both models need to be "good enough" to complement each other rather than drag each other down.

How to Use Roulette Mode and Future Outlook

The team has provided a convenient way to use it—just switch to the swebench_roulette configuration when running mini-extra:

mini-extra swebench \\
  --subset verified \\
  --split test \\
  --shuffle \\
  -o roulette-sonnet4-gpt5 \\
  --workers 20 \\
  -c swebench_roulette

The deeper implication is this: in Agent system design, perhaps we shouldn't obsess over finding the "best single model," but instead think about how to achieve system-level performance that surpasses individuals through model combinations. Currently, GPT-5 and Sonnet 4 happen to be at a performance "sweet spot" close enough that even this simple random strategy produces positive results.

Future directions worth exploring include: smarter model routing strategies (rather than pure random), selecting different models for different task phases, and validating the generalizability of this finding across more Agent scenarios. But at the very least, this experiment tells us—sometimes, the simplest approach is the best approach.