Extra Spicy Morning Paper Review: New Benchmark Challenges LRMs
馃憢 This morning's paper read will be served Extra Spicy by Dylan C. and Parth S.. 馃敟 A new paper from researchers at Apple challenges today鈥檚 evaluation methods and introduces a new benchmark: synthetic puzzles with controllable complexity and clean logic. Their findings? LRMs show surprising failure modes, including a complete collapse on high-complexity tasks and a decline in reasoning effort as problems get harder. BUT THEN someone at Anthropic published a response aptly titled The Illusion of the Illusion of Thinking, which argues that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Here's a direct link to join at 10:00am PT 馃憠 https://arize.zoom.us/j/89593430181
