Challenges of LLMs in Estimating Probabilities and Continuity
The above research on LLM score Evals really hit a cord with over 125k twitter views. Anecdotally I've always had issues with LLMs when getting them to estimate probabilities. I think that was one of the drivers for the testing, namely a feeling that LLMs really struggle with outputting continuous numbers. They struggle with numbers so much it feels like a very open problem with LLMs. Similar to the problems around arithmetic. I'm seeing a lot of papers with results based on Evals with scores. I don't think that it nullifies all the results I see but someone is definitely going to apply it without knowing the limitations. We will see incorrect conclusions. In production deployments I would keep it simple, a handful of classes instead of a continuous numeric range, at least for now.
