We are at the Anyscale office tomorrow (Tuesday) and will be presenting a ton of research around retrieval and haystack testing. The latest results that went out today, show that generation along with retrieval can have wide varying results depending on how much simple math you are doing in the retrieval and if your prompt template shows it work. https://x.com/aparnadhinak/status/1757073620612923785?s=20
Awesome stuff as always! It looks like Claude is more verbose while GPT-4 is more succinct and needed some COT guidance. I’ve been curious to see if any of this is also affected by the “laziness” comments of GPT-4 being lazy and not as as verbose leading to less thorough responses and thus worse actual performance in real use cases. It’s been fairly hard to measure and quantify the actual effect of everyone saying the model has been acting “lazy” with it’s responses but I was looking at the code you shared and noticed you used ‘gpt-4-1106-preview’ instead of the “gpt-4-0125-preview” which OpenAI released and claimed reduces cases of “laziness” so i’m curious to see if it performs any better on the same test. My rough intuition is that theres a lot more to it than just less verbose responses which lead to worse basic COT since people are having it flat out refuse requests and stuff. Since the code is so simple I might just rerun the test for fun on other models but I was wondering if you guys also tested the 01-25 preview and noticed anything different or if there was any specific reason for going with the older 11-06 preview aside from having run the test in the past?
