Anthropic has released an upgraded version of their Claude 3.5 Sonnet language model that achieves a state-of-the-art 49% score on the SWE-bench Verified benchmark, a challenging evaluation of an AI model's ability to solve real-world software engineering tasks. This article details how Anthropic built an "agent" system, including a simple prompt and two general-purpose tools, around Claude 3.5 Sonnet to enable it to achieve this high score. The authors also discuss the challenges they faced in using SWE-bench Verified, such as the duration and high token costs of complex tasks, the need to resolve system issues in grading, and the difficulty of evaluating models that cannot access files saved on the filesystem. They conclude by expressing confidence that developers building with the new Claude 3.5 Sonnet will find ways to improve SWE-bench scores even further.

SHARE

COMMENT

VOICE_COMMENT

COMMENT_PAGE

CLAP

PICK

VOTE

AI_SUMMARIZE

rzhenguniq

AI_SUMMARIZE_EPISODE

Unsupervised

【英文】Claude 3.5 Sonnet 在SWE-bench上取得49%成功率