Enhancements in Random Number Quiz for Improved Evals Performance
What we did:
We have created a needle that is a question about a random number for every generation. This removes the risk of caching. You need to get the number right each time.
We added a random city to the question, so the question changes as well.
The random number length in digits is selectable, we started with 7 digits.
We moved the Evals over to @ArizePhoenix
for significant speed improvements. The GPT-4 test now runs in minutes vs the original 3 days.
We leveraged rails in @ArizePhoenix
Evals, which searches for the random number string in output.
We added the negative case of unanswerable if it doesn’t retrieve the results.
We also run a separate test for the negative case showing how well it knows it can’t retrieve the data
