Newcomer Seeking Advice on Evaluating LLM Chatbots in Phoenix

·Dec 08, 2023 09:18 PM

Hello everyone, I'm new to this community! I'm building a llm based chatbot for customer service support, with a private knowledge base. I am in doubt about two questions: 1. What are the best approaches to evaluating a chatbot from end to end? 2. How can I do this from Phoenix? (I'm having trouble building my golden dataset) Could someone help me, please?

3 comments

· Sorted by Oldest

Aparna D.
·
Hey Victor P., For your customer service bot, are you using retrieval? If so I would recommend these steps:
1.
Create questions on docs
a.
We recommend hand creating them but you can also try to automate creating them. Ideally about 50-200
b.
If you automate the generation of questions, hand check they make sense
3.
(Optional) Hand create ground truth answers to above questions
4.
Choose framework - LlamaIndex is a solid choice but LangChain & roll your own are also common choices
5.
Choose a default setting Retriever, Chunk and K, we used benchmark scripts to pick these (here)
a.
Solid simple choice to start -> use a simple basic retriever, Chunk size 500 and k=6
9.
Choose Evals to run, we recommend:
a.
Q&A to evaluate if the question was answered correctly based on the relevant data retrieved. You'll need your reference and you'll need your query.
b.
Retrieval (chunk level) to evaluate if what you're retrieving is relevant: https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/retrieval-rag-relevance.
c.
Human vs AI (optional) if you have the ground truth above
You can use your previous chat messages and relevant information as part of the reference. If you have the framework you are using (llama or Langchain) we can point you to an example notebook. Happy to also jump on a call too to talk about your application!
Victor P.
·
Thank you very much for such good recommendations, Aparna D.! Yes, I'm using Retrieval. I built my bot using Llama's OpenAiAgent with Query Engine. The bot is working well in general terms, but there are some clear points for improvement, which I believe will be resolved with prompt engineering. Before moving on to improvements, I need to focus on an evaluation system... Do you have any specific guidance or reference for my case? I thank you in advance for your generosity in sharing knowledge.
Shadrack D.
·
Hello , nice to be here in this awesome community. I am trying to build a service which uses llam 2 7b to extract specific insights and summaries from a conversation transcript between two people. How can I successfully deploy this model and serve is through FastApi. Anyone with the experience or knowlwdge to guide me. Will truly appreciate this. I specifically want to know how to do this using 1) SageMaker 2) deploying my own instance to handle everything

Aparna D.
·
Hey Victor P., For your customer service bot, are you using retrieval? If so I would recommend these steps:
1.
Create questions on docs
a.
We recommend hand creating them but you can also try to automate creating them. Ideally about 50-200
b.
If you automate the generation of questions, hand check they make sense
3.
(Optional) Hand create ground truth answers to above questions
4.
Choose framework - LlamaIndex is a solid choice but LangChain & roll your own are also common choices
5.
Choose a default setting Retriever, Chunk and K, we used benchmark scripts to pick these (here)
a.
Solid simple choice to start -> use a simple basic retriever, Chunk size 500 and k=6
9.
Choose Evals to run, we recommend:
a.
Q&A to evaluate if the question was answered correctly based on the relevant data retrieved. You'll need your reference and you'll need your query.
b.
Retrieval (chunk level) to evaluate if what you're retrieving is relevant: https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/retrieval-rag-relevance.
c.
Human vs AI (optional) if you have the ground truth above
You can use your previous chat messages and relevant information as part of the reference. If you have the framework you are using (llama or Langchain) we can point you to an example notebook. Happy to also jump on a call too to talk about your application!
Victor P.
·
Thank you very much for such good recommendations, Aparna D.! Yes, I'm using Retrieval. I built my bot using Llama's OpenAiAgent with Query Engine. The bot is working well in general terms, but there are some clear points for improvement, which I believe will be resolved with prompt engineering. Before moving on to improvements, I need to focus on an evaluation system... Do you have any specific guidance or reference for my case? I thank you in advance for your generosity in sharing knowledge.
Shadrack D.
·
Hello , nice to be here in this awesome community. I am trying to build a service which uses llam 2 7b to extract specific insights and summaries from a conversation transcript between two people. How can I successfully deploy this model and serve is through FastApi. Anyone with the experience or knowlwdge to guide me. Will truly appreciate this. I specifically want to know how to do this using 1) SageMaker 2) deploying my own instance to handle everything