Measuring Sensitivity and Variability in Model Prompts
One thought I had from the above that I don't see in the Evaluation community discussing - is how to measure and think about the sensitivity and variability of models to small variations of prompts. Some models will have a high variation of evaluation score with small prompt changes. Where others will be more robust and perform in a similar fashion if the prompt changes a small amount. Maybe there is a reporting in results of Min and Max ranges or variance with the various prompts. The variability can be problematic for certain use case and use in general. We need some metrics or ways of representing this sensitivity and variance on prompts for specific tasks.
