Evaluating AI Agents on Challenging Long-Horizon SWE Tasks
| Model | % Resolved | (+/-) | Link | 
|---|---|---|---|
| SWE-Agent + claude-4-5-Sonnet | 43.72 | 🔗 | |
| SWE-Agent + claude-4-Sonnet | 42.70 | 🔗 | |
| SWE-Agent + claude-4-5-haiku | 39.45 | 🔗 | |
| SWE-Agent + gpt-5-2025-08-07 (High) | 36.30 | 🔗 | |
| SWE-Agent + glm-4.5 | 35.52 | 🔗 | |
| SWE-Agent + kimi-k2-instruct | 27.67 | 🔗 | |
| SWE-Agent + gpt-oss-120b | 16.20 | 🔗 | 
                Note that these results are initial runs and subject to change, pending an official announcement from Scale. Models are run with an uncapped cost and with a turn limit of 250.
                The (+/-) column shows the 95% confidence interval (margin of error) for each score, calculated using binomial proportion statistics (total problems: 730).
            
We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.
If you found SWE-bench Pro helpful for your work, please cite as follows: