The ndcg and precision metrics are calculated based on the score column in the evaluation data. If the score column is missing or not correctly populated, these metrics will not be computed.According to the trace_eval_ingestion_testing notebook, you need to ensure that the score column is correctly set. For example, you can convert labels to scores like this: LBKSretrieved_documents_eval["score"] = ( retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant" ).astype(int)LBKSWould you like more detail on how to ensure the score column is correctly populated or how to compute these metrics manually?Sources:
- trace_eval_ingestion_testing notebook
- Slack conversation
馃挕聽Hint: Mention RunLLM in the thread for followups.
Oh, no issues then. But tell me one thing how are they getting calculated? cause its 0 for all the contexts and I'm sure that the context its fetching is correct.
ps. can I use gpt-3 instead of gpt-4 for relevance evaluations?