----------------------- If you have human ground truth answers, the human vs AI we find is very solid at helping you do that next level of tuning of your retrieval system but it is not needed to get your first level metrics.
Thanks a lot for your super valuable answers!
Jason, just one more question! I just went through this Google colab you have available at Phoenix for human vs ai eval use case: https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/evals/evaluate_human_vs_ai_classifications.ipynb#scrollTo=4FEeWN8miIcn and I am not understanding the meaning of the "true" label in the initial data: https://storage.googleapis.com/arize-assets/phoenix/evals/human_vs_ai/human_vs_ai_classifications.csv What do "False" and "True" here stand for? There are columns question, correct_answer, ai_generated_answer and ai_answer that are important here, but why then true_label? According to the notebook I need it in the confusion matrix but I am not understanding the meaning of that... Thanks a lot in advance!
Tea The true label won't be used in your production deployment but it is used to validate the performance of the Eval itself. The test dataset is hand created with some "incorrect AI responses" and some "correct AI responses", the dataset is used to test how well the Eval will work on a hand curated dataset. So that when you put it in production you know how well it will work, OR if you want to make template changes you have dataset to confirm we didn't break anything.
