Thanks to all who joined the Paper Reading on Benchmarks this morning!
Recording to follow soon, but in the meantime I wanted to follow up on a question that was asked around benchmarks for AI safety.
The o1 system card has some examples of "safety" benchmarking. There are a whole suite of benchmarks done with various datasets. I recommend checking it out if you are curious around how model safety is measured!