In tomorrow's paper read we'll be covering what's next for AI benchmarking. π Srilakshmi C. & John G. will kick off that discussion by diving into βHumanityβs Last Examβ (ARC-AGI-2). This was designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,700 challenging questions across over a hundred subjects. They'll also cover some recent AI news. Sign up here if you're not already on the list: https://arize.com/resource/community-papers-reading/
Thanks to all who joined the Paper Reading on Benchmarks this morning! Recording to follow soon, but in the meantime I wanted to follow up on a question that was asked around benchmarks for AI safety. The o1 system card has some examples of "safety" benchmarking. There are a whole suite of benchmarks done with various datasets. I recommend checking it out if you are curious around how model safety is measured!
I actually just got the recording up here: https://youtu.be/m03ZMxbWmq0?si=393KsTrl1bh41IgY
