Exploring RAG Observability Metrics: MAP and MRR Insights

·May 28, 2024 07:11 PM

hi! I'm currently exploring RAG observability metrics, specifically MAP (Mean Average Precision) and MRR (Mean Reciprocal Rank). Is there anyway I can leverage the repo below? https://github.com/Arize-ai/phoenix/blob/main/tutorials/evals/evaluate_rag.ipynb

8 comments

· Sorted by Oldest

Mikyo
·
Hey Namira S. - thanks for checking out the project. Phoenix does currently support RAG metrics with no ground truth and supports ad-hoc metric tracking on traces. You can definitely leverage Phoenix to explore RAG quality. How can we help?
Namira S.
·
hi Mikyo! this is me again. I'm finally able to retrieve contexts and documents position. I have exported my input, output, and reference in csv -- can show document position as well. See the pipelines attached (context logging on steps 1,2,3). I want to use Phoenix for nDCG, hit rate, and precision @K (retrieval evals), response evals, and the dashboard px.Client() with the datasets from my CSVs. Is there any repo I can refer to? I checked the previous repo that I sent, it's getting the data from get_spans_dataframe(). Is there anyway I can show my own dataframe?. More contexts: this is vertexAI with Langgraph. Would be great to hear feedbacks or question if there's any!
Mikyo
·
Hey Namira S.! Thanks so much for the context. Super interesting use-case and would love to unblock you. I guess I'd start with the CSV as a dataset: Yes this will be supported but is coming via a new feature called datasets (https://github.com/Arize-ai/phoenix/issues/2017) - you will then be able to run experiments with these datasets to test changes to your application or retrieval. I think this might be ultimately what you are looking for? If so would love to speak to you more and get some early feedback. Would you be open to this? In the meantime however the best way to evaluate RAG with phoenix is to first trace your application. We currently don't have an example with LangGraph specifically but we do have quite a few langchain ones (https://docs.arize.com/phoenix/notebooks). You can then transform traces -> dataframes via queries (https://docs.arize.com/phoenix/tracing/how-to-tracing/extract-data-from-spans) and you can ultimately label the spans themselves by just providing the metrics you calculate (https://docs.arize.com/phoenix/tracing/how-to-tracing/llm-evaluations). I recently gave a talk on this exact flow that might help (https://www.youtube.com/watch?v=vIKW-8YWCSg) Hope that makes sense. Thanks again for sharing your project!
Namira S.
·
Mikyo thanks for this! I'll deep dive today - also it'll be great to give feedbacks on this feature!
Mikyo
·
Amazing! LMK
Namira S.
·
Hii, Mikyo I reviewed the description of the feature on GitHub (issue #2017) and it aligns perfectly with my needs. Currently, I can calculate retrieval metrics (precision, hit rate, etc.) on the fly using traces with Vertex AI. I'm now exploring response evaluation metrics (hallucination and QA relevance). Both retrieval and response results expected in CSV format. My question is: can Phoenix be used to visualize these results in aggregate and by trace, with my own dataset? I dont mind trying the dataset feature and give feedback. Attached is the csv's columns, perhaps in Phoenix definition:
QID: context span id context filtered = reference executine time = latency metrics = tuples consist of hit, precision, retrieval metrics
Mikyo
·
Hey Namira S. would it be okay if I circle pack with you next week? I’m traveling for business this week and so might not get back to you in a timely manner
Namira S.
·
hi Mikyo sure let me know!
1
1

Mikyo
·
Hey Namira S. - thanks for checking out the project. Phoenix does currently support RAG metrics with no ground truth and supports ad-hoc metric tracking on traces. You can definitely leverage Phoenix to explore RAG quality. How can we help?
Namira S.
·
hi Mikyo! this is me again. I'm finally able to retrieve contexts and documents position. I have exported my input, output, and reference in csv -- can show document position as well. See the pipelines attached (context logging on steps 1,2,3). I want to use Phoenix for nDCG, hit rate, and precision @K (retrieval evals), response evals, and the dashboard px.Client() with the datasets from my CSVs. Is there any repo I can refer to? I checked the previous repo that I sent, it's getting the data from get_spans_dataframe(). Is there anyway I can show my own dataframe?. More contexts: this is vertexAI with Langgraph. Would be great to hear feedbacks or question if there's any!
Mikyo
·
Hey Namira S.! Thanks so much for the context. Super interesting use-case and would love to unblock you. I guess I'd start with the CSV as a dataset: Yes this will be supported but is coming via a new feature called datasets (https://github.com/Arize-ai/phoenix/issues/2017) - you will then be able to run experiments with these datasets to test changes to your application or retrieval. I think this might be ultimately what you are looking for? If so would love to speak to you more and get some early feedback. Would you be open to this? In the meantime however the best way to evaluate RAG with phoenix is to first trace your application. We currently don't have an example with LangGraph specifically but we do have quite a few langchain ones (https://docs.arize.com/phoenix/notebooks). You can then transform traces -> dataframes via queries (https://docs.arize.com/phoenix/tracing/how-to-tracing/extract-data-from-spans) and you can ultimately label the spans themselves by just providing the metrics you calculate (https://docs.arize.com/phoenix/tracing/how-to-tracing/llm-evaluations). I recently gave a talk on this exact flow that might help (https://www.youtube.com/watch?v=vIKW-8YWCSg) Hope that makes sense. Thanks again for sharing your project!
Namira S.
·
Mikyo thanks for this! I'll deep dive today - also it'll be great to give feedbacks on this feature!
Mikyo
·
Amazing! LMK
Namira S.
·
Hii, Mikyo I reviewed the description of the feature on GitHub (issue #2017) and it aligns perfectly with my needs. Currently, I can calculate retrieval metrics (precision, hit rate, etc.) on the fly using traces with Vertex AI. I'm now exploring response evaluation metrics (hallucination and QA relevance). Both retrieval and response results expected in CSV format. My question is: can Phoenix be used to visualize these results in aggregate and by trace, with my own dataset? I dont mind trying the dataset feature and give feedback. Attached is the csv's columns, perhaps in Phoenix definition:
QID: context span id context filtered = reference executine time = latency metrics = tuples consist of hit, precision, retrieval metrics
Mikyo
·
Hey Namira S. would it be okay if I circle pack with you next week? I’m traveling for business this week and so might not get back to you in a timely manner
Namira S.
·
hi Mikyo sure let me know!
1
1