Insights on Claude V2 Evals: Challenges and Variance Observed
Evals - Retrieval Relevance: https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/retrieval-rag-relevance We’ve been testing Evals on Claude V2. Interested in how many people have a lot of experience with Claude? What we are seeing - there is still a big gap on specific Eval tasks versus GPT-4. Especially more complex tasks like retrieval relevance evaluation, claude V2 is falling short. We are also seeing some very strange variance in results. where if we repeat a prompt twice, we get better Eval results. Happy to provide the failing prompts to folks if interested on the research side.
