Delays in Trace Visibility with Phoenix on GCP: Query Insights | Arize AI Community

Arize AI Community Icon

RunLLM
APP
·
Yes, with large trace volumes (thousands of spans, hundreds of thousands of tokens), it is normal for traces to appear in the Phoenix UI with significant delay—sometimes hours—especially when using HTTP and batch mode, due to the time required for ingestion, database writes, and UI refreshes. This is exacerbated if the database or Phoenix server is under heavy load, or if the connection pool is exhausted, which can cause backlogs and slow processing of incoming traces. The "Stream" toggle in the UI only affects how often the frontend polls for new data, not the backend ingestion speed or order of appearance in the UI. If you see traces appearing long after the queries ran, it likely means the Phoenix server and/or Postgres were still processing the backlog after your app exited. This is expected behavior under heavy load, and is not specific to your HTTP/protobuf config. For more on performance and connection pool issues, see <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1722534335.071179|this discussion> and <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724775125.127529|here>. If you want to reduce lag, consider increasing your Postgres connection pool size, scaling up your VM, or batching fewer traces at a time. Would you like more detail on tuning performance or troubleshooting ingestion delays? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1722534335.071179|this discussion> - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1724775125.127529|here>
💡 Hint: Mention RunLLM in the thread for followups.
RunLLM
APP
·
If you tell me the right answer to this question, I'll make sure I don't make the same mistake twice!
Feedback
Type the right answer here...
↵ Please 'enter' to submit
Mikyo
·
Hey Ben E. the bot is hallucinating here. It could be a few things but mainly I'd probably want to understand your OTEL configuration (https://opentelemetry.io/docs/collector/configuration/) If you are going to export spans fast and in rapid succession it's recommended to tune your batching strategy. By batching the exports happen in the background. Because of this, the "Trace" might not seem arrived until the final span is emitted and ingested. The spans will arrive on Phoenix as exported but they might be placed in memory queues to protect massive concurrent writes so memory usage can go up based on volume. IT's probably not advisable long term to share resources with your application just from a decoupling perspective. In summary:
setup proper batching
provision phoenix with some extra memory for the queue if you plan on high writes
The resource exhausted message I have seen before on occasion when things are mis-configured. In general it should be solvable via some sort of configuration so happy to dig in there.
🙌2
Ben E.
·
I honestly don't think I have an OTEL config. The VM uses a minimal Ubuntu 24.04 image, I installed gh and uv to clone and install the project's dependencies, and then ran uv run phoenix serve and my app. When starting my app, this is what I get:
🔭 OpenTelemetry Tracing Details 🔭 | Phoenix Project: mabool | Span Processor: BatchSpanProcessor | Collector Endpoint: http://localhost:6006/v1/traces | Transport: HTTP + protobuf | Transport Headers: {} | | Using a default SpanProcessor. `add_span_processor` will overwrite this default. | | `register` has set this TracerProvider as the global OpenTelemetry default. | To disable this behavior, call `register` with `set_global_tracer_provider=False`.
and the VM doesn't seem to have /etc/otel* stuff in it. The app is not (currently) open-source, so I can't share it publicly just yet, but if there's any more information I can give, I'll be glad to share it! I remember trying to play around with environment variables which I thought would control the batch queue size and send size, but I can't find the source of that. It didn't seem to help, but maybe I was just way off in the config.
Mikyo
·
Got it, yeah I'd definitely think about using one of the phoenix images at some point. It will make the upgrade process a lot easier and deterministic. We have a helm chart too if you are using Kubernetes. The other side benefit will be that you won't need to install arize-phoenix as a dependency. You can just install the OTEL instrumentation + client if you want to send evals / feedback. With regards to batching this is the config options you have: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor It's worth tuning these based on the throughput you want.
Ben E.
·
I think using the Phoenix image might be the best idea, provision something on GKE, tune the batch processor's env variables and tinker with that 🙂 Thanks, Mikyo! 🙏
1