Questions on Using Phoenix for RAG System Benchmarking
Hi all, After starting to use phoenix (specifically run_experiment() ) to benchmark our agentic RAG system on a golden dataset, a few questions came up:
- 1.
We are storing input parameters associated with an experiment eg. model id, chunking strategy, system prompt etc. in the Phoenix experiment metadata. Is this what you would recommend, or is there a better place to put it? Also, I'm curious if it is possible to do a diff on the metadata to track changes between experiments?
- 2.
Is there any functionality around labelling ie. to be able to run an experiment over a subset of examples in a dataset, or subset of evaluators for certain examples? Perhaps the best way of handling this would be the metadata, or the dataset we upload?
- 3.
What is the best way to get run_experiment results into Arize? I have been trying to piece together a dataframe with label, score, explanation, and span_id columns from the output of run_experiment (the RanExperiment class), but I currently can't find the span_id, and it is getting a bit complex. I wondered if there is a straightforward way to do it? It would be nice to be able to see what Arize offers, in terms of sorting experiments in order of decreasing evaluation metric score, applying other filtering to results, and graphing results over time and things, which I don't believe is possible in Phoenix? I have Phoenix experiment data persisted in a SQLite database in case that simplifies getting data into Arize?
- 4.
For RAG evaluations, I'm displaying a JSON with the question, answer, and source documents in the Phoenix UI as the output for each example in an experiment, as it seems that is necessary to feed all the information into the evaluators (given evaluator input and task output displayed by the UI are the same thing). Is that best practice? It is slightly harder to digest by eye when looking over the results, but not a huge deal.
Thanks!
