Best Practices for Tracing LLM Chat Apps in Phoenix

·Jan 30, 2025 09:28 PM

Hello! This is probably a very stupid question as I am still trying to understand the concepts. So imagine I have a basic llm chat app that the only customization is a system prompt. If I want to send traces to phoenix what is the best practice here: One span per message (but they contain all previous messages) One span per chat session and each child span is a question/answer from the llm. I also assume the kind should the LLM in either of these cases options right?

10 comments

· Sorted by Oldest

RunLLM
APP
·
For a basic LLM chat app, the best practice is to create one span per chat session, with each child span representing a question/answer interaction from the LLM. This approach allows you to capture the entire conversation context while maintaining detailed insights into each interaction. The span kind for these interactions should indeed be set to "LLM" as it represents a call to an LLM, according to <https://docs.arize.com/phoenix/tracing/llm-traces#B%3A3|Phoenix documentation>. Would you like more detail on how to implement this? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089|https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089> - <https://github.com/Arize-ai/phoenix/pull/1410|https://github.com/Arize-ai/phoenix/pull/1410>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
I would probably use one of our auto-instrumentors for the LLM calls so you can always replay them later to do some "what if" testing later. I would probably wrap a single interaction with your app in a single top level span that has the user input and the llm output. If you then have a conversational structure I'd tie them together via sessions. You can see the benefit of having a top level span that represents the IO of the system here: https://phoenix-demo.arize.com/projects/UHJvamVjdDoz
Mikyo
·
This might be the example you are looking for https://github.com/Arize-ai/phoenix/tree/main/examples/manually-instrumented-chatbot
David
·
Thank you! 🙇
David
·
I will take a careful look
Mikyo
·
Actually don't manually instrument if possible 😅 The quickest time to value would be to just hook up one of these and add spans as you need (e.g. you want to view more details) https://github.com/Arize-ai/openinference
David
·
we use this openinference and are doing it with bedrock but it doesnt look very good. everything is flat and we loose the custom attributes like the session_id.
David
·
just to confirm my intuition regarding the evals thing: I'm assuming that these "no llm call" evals can probably be run at the same time you send a trace to phoenix but if you want these more expensive llm evals (like the hallucinate and so on) you do it manually by running some code every once in a while? or do you do it for every llm trace you send?
Mikyo
·
we use this openinference and are doing it with bedrock but it doesnt look very good. everything is flat and we loose the custom attributes like the session_id.
Oh interesting. I think this should be fixed in some of the later work with our tracers. Maybe Xander S. could take a look when he has a chance. We'd love to understand what's not working.
Mikyo
·
just to confirm my intuition regarding the evals thing: I'm assuming that these "no llm call" evals can probably be run at the same time you send a trace to phoenix but if you want these more expensive llm evals (like the hallucinate and so on) you do it manually by running some code every once in a while? or do you do it for every llm trace you send?
Yeah we tend to recommend doing them in batches to de-couple your runtime from the evaluation process. It's been a minute since I looked at this guide but you can take a look at one strategy here. https://docs.arize.com/phoenix/evaluation/how-to-evals/online-evals

RunLLM
APP
·
For a basic LLM chat app, the best practice is to create one span per chat session, with each child span representing a question/answer interaction from the LLM. This approach allows you to capture the entire conversation context while maintaining detailed insights into each interaction. The span kind for these interactions should indeed be set to "LLM" as it represents a call to an LLM, according to <https://docs.arize.com/phoenix/tracing/llm-traces#B%3A3|Phoenix documentation>. Would you like more detail on how to implement this? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089|https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089> - <https://github.com/Arize-ai/phoenix/pull/1410|https://github.com/Arize-ai/phoenix/pull/1410>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
I would probably use one of our auto-instrumentors for the LLM calls so you can always replay them later to do some "what if" testing later. I would probably wrap a single interaction with your app in a single top level span that has the user input and the llm output. If you then have a conversational structure I'd tie them together via sessions. You can see the benefit of having a top level span that represents the IO of the system here: https://phoenix-demo.arize.com/projects/UHJvamVjdDoz
Mikyo
·
This might be the example you are looking for https://github.com/Arize-ai/phoenix/tree/main/examples/manually-instrumented-chatbot
David
·
Thank you! 🙇
David
·
I will take a careful look
Mikyo
·
Actually don't manually instrument if possible 😅 The quickest time to value would be to just hook up one of these and add spans as you need (e.g. you want to view more details) https://github.com/Arize-ai/openinference
David
·
we use this openinference and are doing it with bedrock but it doesnt look very good. everything is flat and we loose the custom attributes like the session_id.
David
·
just to confirm my intuition regarding the evals thing: I'm assuming that these "no llm call" evals can probably be run at the same time you send a trace to phoenix but if you want these more expensive llm evals (like the hallucinate and so on) you do it manually by running some code every once in a while? or do you do it for every llm trace you send?
Mikyo
·
we use this openinference and are doing it with bedrock but it doesnt look very good. everything is flat and we loose the custom attributes like the session_id.
Oh interesting. I think this should be fixed in some of the later work with our tracers. Maybe Xander S. could take a look when he has a chance. We'd love to understand what's not working.
Mikyo
·
just to confirm my intuition regarding the evals thing: I'm assuming that these "no llm call" evals can probably be run at the same time you send a trace to phoenix but if you want these more expensive llm evals (like the hallucinate and so on) you do it manually by running some code every once in a while? or do you do it for every llm trace you send?
Yeah we tend to recommend doing them in batches to de-couple your runtime from the evaluation process. It's been a minute since I looked at this guide but you can take a look at one strategy here. https://docs.arize.com/phoenix/evaluation/how-to-evals/online-evals