Anthropic's Claude 4 Outperforms Code Benchmarks with New Features

·May 22, 2025 07:43 PM

Claude 4 models are meaningfully better on code benchmarks than o3 — the best reasoning model so far. And Anthropic is bundling web search, a Python execution sandbox and a files API 🤯 As far as dev-ex for agents goes, I think Anthropic has pulled ahead now!

1 comment

· Sorted by Oldest

John G.
·
Just playing with the models today, 4-sonnet does very well with some multi-file code challenges as well. I'm loving it so far!
✅1

John G.
·
Just playing with the models today, 4-sonnet does very well with some multi-file code challenges as well. I'm loving it so far!
✅1