RAG Relevance Evals which models to use:
In the case of task specific Evals, we are finding in some cases only GPT-4 should be considered usable. The Eval Judge model should be your largest model.
Results from Claude V2 testing with RAG Relevance Evals, it struggles quite a lot.
https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/retrieval-rag-relevance