Perfect content, Aparna D.. Thank you very much! Would you have any guidance on how to conduct llm evals for a customer service bot? I couldn't see how standard Q&A templates could be applied. Could you give me some idea about it? Should I consider previous chat messages and relevant information to answer the questions in "context"? Are you thinking about including this template?
Hey Victor P., thanks for the note! For your customer service bot, are you using retrieval? If so I would recommend these steps:
Create questions on docs
We recommend hand creating them but you can also try to automate creating them. Ideally about 50-200
If you automate the generation of questions, hand check they make sense
(Optional) Hand create ground truth answers to above questions
Choose framework - LlamaIndex is a solid choice but LangChain & roll your own are also common choices
Choose a default setting Retriever, Chunk and K, we used benchmark scripts to pick these (here)
Solid simple choice to start -> use a simple basic retriever, Chunk size 500 and k=6
Choose Evals to run, we recommend:
Q&A to evaluate if the question was answered correctly based on the relevant data retrieved. You'll need your reference and you'll need your query.
Retrieval (chunk level) to evaluate if what you're retrieving is relevant: https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/retrieval-rag-relevance.
Human vs AI (optional) if you have the ground truth above
You can use your previous chat messages and relevant information as part of the reference. If you have the framework you are using (llama or Langchain) we can point you to an example notebook. Happy to also jump on a call too to talk about your application!
