mini-SWE-agent Roulette Mode: Why Randomly Switching Between LLMs Actually Performs Better

Randomly switching between GPT-5 and Sonnet 4 in mini-SWE-agent outperforms either model alone on SWE-bench.
The SWE-agent team discovered that mini-SWE-agent's "Roulette Mode"—randomly switching between GPT-5 and Claude Sonnet 4 at each step—scores higher on SWE-bench than either model individually. This simple random.choice() swap requires no prompt or architecture changes. The effect only works between performance-matched models, suggesting a diversity hypothesis similar to ensemble methods in ML.
A Counterintuitive Discovery: Randomly Switching Models Outperforms Any Single Model
What happens if your AI Agent randomly selects a different large language model at each reasoning step? Intuitively, this sounds like a chaotic strategy. But the latest experiment from the SWE-agent team shows that letting mini-SWE-agent randomly switch between GPT-5 and Claude Sonnet 4 actually scores higher on SWE-bench than either model running alone.
This experiment, dubbed "Roulette Mode," involves an extremely simple core change—just replacing model.query(history) with random.choice([model1, model2]).query(history), and nothing else. The prompt stays exactly the same, the architecture stays exactly the same, yet it achieves a 1+1>2 effect.
![rss source: [mini-SWE-agent] Roulette mode!](/media/screenshots/source/10610_0.png)
What is mini-SWE-agent: A Minimalist Agent in Under 100 Lines of Code
To understand the significance of this experiment, you first need to understand mini-SWE-agent's design philosophy. Unlike the feature-rich swe-agent, mini is a minimalist software engineering Agent whose core Agent class is under 100 lines of code, yet achieves competitive scores on the SWE-bench leaderboard.
Three core design principles of mini-SWE-agent:
- The only tool is bash: It doesn't use any LM tool-calling interfaces, meaning it can work with any model. When running in a sandbox environment, it doesn't even need any additional packages installed.
- Completely linear history: Each step simply appends to the message list. The trajectory is identical to the messages passed to the LM, making it ideal for debugging and fine-tuning.
- Stateless execution: Each action is executed independently via
subprocess.run, rather than maintaining a stateful shell session. This makes sandbox execution and scaling extremely simple—just replacesubprocess.runwithdocker exec.
It's precisely this minimalist design that makes implementing Roulette Mode exceptionally clean. No complex routing logic, no model selection strategy—just pure random switching.
Roulette Mode Results: Synergistic Gains from GPT-5 and Sonnet 4
Full SWE-bench Test: The Combination Beats Individual Models
On the full 500-instance SWE-bench verified test, the random combination of GPT-5 and Sonnet 4 achieved a higher score than either model running independently. The team also tried an alternating mode (rather than random switching), solving 333 out of 500 instances (66.6%), which also surpassed both models' individual performance.
Small-Scale Multi-Model Combination Comparison
On 50 randomly sampled instances, the team tested more combinations:
| Model Combination | Score (50 instances) |
|---|---|
| GPT-5 + Sonnet 4 | 39 |
| GPT-5 + Sonnet 4 + Gemini 2.5 Pro | 33 |
| GPT-5 + Gemini 2.5 Pro | 31 |
| GPT-5 + GPT-5-mini | 31 |
| GPT-5 mini + GPT-5 nano | 20 |
Compared to single-model baselines:
| Single Model | Score (50 instances) |
|---|---|
| Sonnet 4 | 33 |
| GPT-5 | 32 |
| GPT-5-mini | 32 |
| Gemini 2.5 Pro | 29 |
| GPT-5-nano | 16 |
A key finding: only the GPT-5 and Sonnet 4 combination achieved performance beyond either individual model. These two models happen to be in a closely matched competitive state. When the performance gap between combined models is large (e.g., GPT-5 and Gemini 2.5 Pro), the combined score and cost simply fall between the two models, producing no synergistic gain.
It's worth noting that the 50-instance sample size has statistical power limitations. The team candidly pointed out that the randomly sampled subset happened to skew easier, with all scores slightly higher than those on the full 500-instance set.
Cost Analysis: What's the Cost-Effectiveness of Roulette Mode?
Relationship Between Performance and Inference Steps
Experiments show that Roulette Mode's performance gains reach diminishing marginal returns at around 50 steps. Interestingly, this curve more closely resembles GPT-5's behavior pattern rather than Sonnet 4's slow-climbing characteristic. The team speculates this is because either model can decide to submit results and end the run, so the overall behavior more closely resembles the model that "submits earlier."
Approximately 30 Cents Per Instance
At the maximum performance point, the average cost per instance is approximately 30 cents, essentially falling between the individual costs of the two models. Cost grows in an S-curve pattern, echoing the diminishing marginal returns in performance improvement.
You might not have noticed, but cost analysis for Agent-based systems is inherently tricky—most overhead is spent on instances that can't be solved, so average cost is highly dependent on runtime limits (such as step count limits).
Why Does Randomly Switching LLMs Improve Performance?
While the team hasn't provided a definitive theoretical explanation, the phenomenon can be understood from several angles:
Diversity Hypothesis: Different models have their own strengths and blind spots across different types of programming tasks. Random switching effectively introduces "cognitive diversity," reducing the probability of a single model consistently making errors on specific types of problems. This is similar to the core idea behind Ensemble Methods in machine learning—combining multiple weakly correlated predictors often outperforms any single predictor.
Breaking Mental Fixation: When one model gets stuck in a particular direction, switching to another model may bring an entirely different problem-solving approach, acting as an implicit "restart" mechanism.
Performance Parity is a Prerequisite: This effect only appears between models of comparable performance, suggesting that both models need to be "good enough" to complement each other rather than drag each other down.
How to Use Roulette Mode and Future Outlook
The team has provided a convenient way to use it—just switch to the swebench_roulette configuration when running mini-extra:
mini-extra swebench \\
--subset verified \\
--split test \\
--shuffle \\
-o roulette-sonnet4-gpt5 \\
--workers 20 \\
-c swebench_roulette
The deeper implication is this: in Agent system design, perhaps we shouldn't obsess over finding the "best single model," but instead think about how to achieve system-level performance that surpasses individuals through model combinations. Currently, GPT-5 and Sonnet 4 happen to be at a performance "sweet spot" close enough that even this simple random strategy produces positive results.
Future directions worth exploring include: smarter model routing strategies (rather than pure random), selecting different models for different task phases, and validating the generalizability of this finding across more Agent scenarios. But at the very least, this experiment tells us—sometimes, the simplest approach is the best approach.
Related articles

The Clotilda: Underwater Archaeological Discovery of America's Last Slave Ship
The Clotilda, America's last slave ship, was discovered by underwater archaeologists in Alabama nearly 160 years after sinking. Learn about the search, key evidence, and other slave trade shipwreck discoveries.

Sakana AI in Practice: Reshaping Banking Lending Operations with AI Agents — Technology and Strategy
Deep dive into how Sakana AI applies AI Agents to banking lending operations, covering end-to-end support from information gathering to approval document generation, plus technical challenges and human-AI collaboration design.

Instagram Enters the Living Room: Long-Form Video, Series, and Live Streaming Challenge Netflix
Instagram is building a TV app with long-form video, episodic series, and live streaming to challenge Netflix. Deep analysis of its living room strategy and industry impact.