Best Practices for Logging Streaming LLM Spans Effectively

Swaraj R. · 2025-02-05T19:08:41.790Z

We've been using non-streaming llm calls up until now and haven't seen the following issue we are now seeing with new streaming calls we've added we have a nested tracing structure (start_as_current_span calls inside other start_as_current_span calls) with one root parent span that wraps all code we are seeing the streaming llm call's LLM spans get orphaned and not connect to the nested traces and it's span (non streaming llm spans continue to get correctly associated in nested trace spans) Are there any best practices to follow when trying to log streaming llm spans correctly?

RunLLM
APP
·
It seems likely that the issue with orphaned streaming LLM spans is related to the context propagation in asynchronous calls. My understanding is that when using streaming, the context might not be automatically propagated, causing spans to become orphaned. According to <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089|this Slack discussion>, you may need to manually set the span context in asynchronous calls to ensure they are correctly associated with the parent span. Would you like more detail on how to implement this? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724796536.922089|Slack Discussion>
💡 Hint: Mention RunLLM in the thread for followups.
Swaraj R.
·
yes please give more detail RunLLM
Swaraj R.
·
RunLLM actually the slack discussion you mentioned is about joining calls across traces (coming from different runs/api calls/processes etc) i'm talking about the same process that is running one workflow with nested spans is leading to orphaned streaming llm spans
RunLLM
APP
·
To address the issue of orphaned streaming LLM spans within the same process, you need to ensure that the context is correctly propagated during asynchronous operations. This is crucial because asynchronous calls can lose the context if not explicitly managed. Here's a step-by-step approach based on the <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1714665357.760569|Slack discussion> and <https://docs.arize.com/phoenix/tracing/how-to-tracing/manual-instrumentation/custom-spans|manual instrumentation guide>: 1. *Start a Root Span*: Begin by creating a root span using `start_as_current_span`. This will serve as the parent span for all subsequent operations. 2. *Propagate Context in Async Calls*: When dealing with asynchronous operations, especially streaming, you need to manually propagate the context. This can be done by setting the span in the context before making the async call and ensuring the context is used within the async function. 3. *Use Context Managers*: Utilize context managers to maintain the span context across asynchronous boundaries. This ensures that the span remains active and associated with the correct parent span. 4. *End Spans Appropriately*: Ensure that spans are ended correctly after the operation completes. This includes handling exceptions and ensuring the span is ended even if an error occurs. Here's a code snippet to illustrate this: ``` from opentelemetry import trace from opentelemetry.context import attach, detach, get_current tracer = trace.get_tracer(__name__) async def streaming_operation(): # Start a new span for the streaming operation with tracer.start_as_current_span("streaming_operation") as span: # Propagate the context manually token = attach(get_current()) try: # Perform the streaming operation async for data in stream_data(): # Process data pass finally: # Detach the context to ensure proper cleanup detach(token) span.end() async def main_workflow(): # Start the root span with tracer.start_as_current_span("root_workflow") as root_span: # Call the streaming operation await streaming_operation() # Run the main workflow await main_workflow() ``` This approach ensures that the context is correctly propagated and maintained across asynchronous calls, preventing orphaned spans. Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1714665357.760569|Slack Discussion> - <https://docs.arize.com/phoenix/tracing/how-to-tracing/manual-instrumentation/custom-spans|manual instrumentation guide>
💡 Hint: Mention RunLLM in the thread for followups.

Swaraj R.

RunLLM what if i'm using Litellm.completion and OpenAiInstrumentor (auto instrumentation) this the crux of my code

            res: Any = completion(
                api_key=self.config.api_key,
                api_base=self.config.api_base,
                api_version=self.config.api_version,
                model=engine,
                messages=messages,  # type: ignore
                temperature=temperature,
                max_tokens=max_tokens,
                top_p=top_p,
                frequency_penalty=self.config.frequency_penalty,
                presence_penalty=self.config.presence_penalty,
                stop=stop,
                stream=stream,
                seed=seed,
                response_format=response_format,
                extra_body=extra_body,
                **tools_kwargs,
            )
            if stream:
                role: Any = None
                for stream_res in res:
                    if not stream_res.choices:
                        continue
                    delta = stream_res.choices[0].delta
                    if delta is None:
                        continue
                    role = delta.role if delta.role is not None else role
                    content = delta.content if delta.content is not None else ""
                    if content is None:
                        continue
                    yield format_chat_message(role, content)

the actual llm span creation i'm assuming is done by the above packages (specifically at the end of .completion), so don't have direct control Should i just manage the context before i make the completion call ?

RunLLM
APP
·
When using `Litellm.completion` with `OpenAiInstrumentor` for auto-instrumentation, the span creation is indeed handled by the instrumentation. However, to ensure that the context is correctly propagated, especially for streaming responses, you should manage the context before making the `completion` call. This involves setting the context to ensure that the spans are correctly nested under the parent span. Here's how you can manage the context: 1. *Start a Parent Span*: Before making the `completion` call, start a parent span using `start_as_current_span`. 2. *Ensure Context Propagation*: Use the context management utilities to ensure that the context is correctly propagated during the streaming operation. 3. *End the Parent Span*: After the streaming operation is complete, ensure that the parent span is ended. Here's a code snippet to illustrate this: ``` from opentelemetry import trace from opentelemetry.context import attach, detach, get_current tracer = trace.get_tracer(__name__) async def perform_completion(): # Start a parent span with tracer.start_as_current_span("parent_span") as parent_span: # Attach the current context token = attach(get_current()) try: # Perform the completion call res = completion( api_key=self.config.api_key, api_base=self.config.api_base, api_version=self.config.api_version, model=engine, messages=messages, temperature=temperature, max_tokens=max_tokens, top_p=top_p, frequency_penalty=self.config.frequency_penalty, presence_penalty=self.config.presence_penalty, stop=stop, stream=stream, seed=seed, response_format=response_format, extra_body=extra_body, **tools_kwargs, ) if stream: role = None for stream_res in res: if not stream_res.choices: continue delta = stream_res.choices[0].delta if delta is None: continue role = delta.role if delta.role is not None else role content = delta.content if delta.content is not None else "" if content is None: continue yield format_chat_message(role, content) finally: # Detach the context to ensure proper cleanup detach(token) parent_span.end() # Run the completion await perform_completion() ``` This approach ensures that the context is correctly managed and propagated, preventing orphaned spans in the trace. Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1714665357.760569|Slack Discussion>
💡 Hint: Mention RunLLM in the thread for followups.
Swaraj R.
·
RunLLM that didn't work, now the parent span + llm streaming span are the orphaned trace (2 spans)
RunLLM
APP
·
When dealing with streaming LLM calls and auto-instrumentation, ensuring proper context propagation can be tricky. If the parent span and the LLM streaming span are still orphaned, it might be due to the asynchronous nature of the streaming operation. Here are some additional steps you can take: 1. *Ensure Context is Active*: Before making the `completion` call, ensure that the context is active and correctly set. This can be done by using `trace.use_span` to explicitly set the span as the current span. 2. *Check Instrumentation Configuration*: Verify that the `OpenAiInstrumentor` and `Litellm` are correctly configured to propagate context. Sometimes, configuration issues can lead to context not being propagated as expected. 3. *Debug Context Propagation*: Use logging or debugging tools to check the context at various points in your code to ensure it is being propagated correctly. Here's an updated approach: ``` from opentelemetry import trace from opentelemetry.context import attach, detach, get_current, set_span_in_context tracer = trace.get_tracer(__name__) async def perform_completion(): # Start a parent span with tracer.start_as_current_span("parent_span") as parent_span: # Set the span in the current context context = set_span_in_context(parent_span) token = attach(context) try: # Perform the completion call res = completion( api_key=self.config.api_key, api_base=self.config.api_base, api_version=self.config.api_version, model=engine, messages=messages, temperature=temperature, max_tokens=max_tokens, top_p=top_p, frequency_penalty=self.config.frequency_penalty, presence_penalty=self.config.presence_penalty, stop=stop, stream=stream, seed=seed, response_format=response_format, extra_body=extra_body, **tools_kwargs, ) if stream: role = None for stream_res in res: if not stream_res.choices: continue delta = stream_res.choices[0].delta if delta is None: continue role = delta.role if delta.role is not None else role content = delta.content if delta.content is not None else "" if content is None: continue yield format_chat_message(role, content) finally: # Detach the context to ensure proper cleanup detach(token) parent_span.end() # Run the completion await perform_completion() ```
RunLLM
APP
·
This approach explicitly sets the span in the current context, which should help in maintaining the correct parent-child relationship in the trace. Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1714665357.760569|Slack Discussion>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
Hi Swaraj R.! When streaming there's a number of situations that might cause your span hierarchy to change relative to the non-streaming case. When streaming an LLM call, the final span isn't submitted until the stream is fully consumed, which might cause the nesting to show up differently depending on when you're looking at all the traces. Can you check to see if waiting until all the streams have finished and refreshing shows your spans the way you expect?
Swaraj R.
·
Can you check to see if waiting until all the streams have finished and refreshing shows your spans the way you expect?
by waiting you mean check phoenix after my workflow is done running? if so, i did check afterwards and there are 3 top level traces
1.
the main one, with all non-streaming llm spans and all other spans, properly arranged
2.
and 3. are the orphaned streaming llm spans
Dustin N.
·
are all the streaming llm spans orphaned or only some of them?
Swaraj R.
·
all streaming are orphaned all non-streaming are not
Dustin N.
·
how is your perform_completion method called?
Dustin N.
·
can you try getting the context outside the scope of the function making the streaming completion request?

Best Practices for Logging Streaming LLM Spans Effectively

37 comments

Best Practices for Logging Streaming LLM Spans Effectively

37 comments