https://docs.arize.com/phoenix/llm-evals/quickstart-retrieval-evals
What we recommend for RAG is:
Retrieval Eval (Chunk Level) - This drives NDCG/MRR for retrieval evaluation
Q&A Eval - Did you answer the question correctly
Hallucination - Did you make up information in the answer
Human vs AI - If you have ground truth answers this can be very helpful to tune your system
We have focused on testable metrics that we know work and give signal. I know there are other content Evals out there, you can add a template to tackle something custom. They just wont be vetted or tested.