Anyone else listen to Nathan Lambert and Dylan Patel break down the R1 paper? One takeaway I found interesting was the fact that reasoning models seem to perform much better with preference learning (RLHF) post training on specific tasks (around 2:57 mark) - meaning, there’s probably a lot of gain to be had for fine tuned reasoning models on specific tasks. Curious if others are exploring fine tuned reasoning models?
https://youtu.be/_1f-o0nqpEI?si=9DeLaIRw6654Gmx4