Another question, if I may 🙂 I'm running Phoenix on a GCP VM with a separate Postgres instance backing it up. I'm running a benchmark of ~300 queries against my system, each such query runs prompts that total in the hundreds of thousands of tokens (I had a "session" with more than 7,000 spans). I registered Phoenix using register(project_name="mabool", batch=True, protocol="http/protobuf", auto_instrument=True) I'm using HTTP because gRPC seems to choke at some point (something with "resource exhausted" from otel, couldn't find something about it online). My question is: using the above config, is it normal to see traces pop up in the UI hours after a query ran? I'm asking because I thought some queries completely failed, so I killed my app, and hours later these sessions appeared in the UI. The "Stream" option is turned on, by the way. Thanks again for the help and for Phoenix, the self-hosting is amazing DX-wise 🤩 🙏 EDIT: I forgot to mention: my system runs on the same GCP VM that Phoenix runs on, I don't know how much it matters, i.e., how resource-intensive Phoenix is (my system is mainly IO-bound with all the network calls, not much CPU/RAM usage).
Yes, with large trace volumes (thousands of spans, hundreds of thousands of tokens), it is normal for traces to appear in the Phoenix UI with significant delay—sometimes hours—especially when using HTTP and batch mode, due to the time required for ingestion, database writes, and UI refreshes. This is exacerbated if the database or Phoenix server is under heavy load, or if the connection pool is exhausted, which can cause backlogs and slow processing of incoming traces. The "Stream" toggle in the UI only affects how often the frontend polls for new data, not the backend ingestion speed or order of appearance in the UI. If you see traces appearing long after the queries ran, it likely means the Phoenix server and/or Postgres were still processing the backlog after your app exited. This is expected behavior under heavy load, and is not specific to your HTTP/protobuf config. For more on performance and connection pool issues, see this discussion and here. If you want to reduce lag, consider increasing your Postgres connection pool size, scaling up your VM, or batching fewer traces at a time.Would you like more detail on tuning performance or troubleshooting ingestion delays?Sources: - this discussion - here
💡 Hint: Mention RunLLM in the thread for followups.
If you tell me the right answer to this question, I'll make sure I don't make the same mistake twice!
Feedback
Type the right answer here...
↵ Please 'enter' to submit
Hey Ben E. the bot is hallucinating here. It could be a few things but mainly I'd probably want to understand your OTEL configuration (https://opentelemetry.io/docs/collector/configuration/) If you are going to export spans fast and in rapid succession it's recommended to tune your batching strategy. By batching the exports happen in the background. Because of this, the "Trace" might not seem arrived until the final span is emitted and ingested. The spans will arrive on Phoenix as exported but they might be placed in memory queues to protect massive concurrent writes so memory usage can go up based on volume. IT's probably not advisable long term to share resources with your application just from a decoupling perspective. In summary:
setup proper batching
provision phoenix with some extra memory for the queue if you plan on high writes
The resource exhausted message I have seen before on occasion when things are mis-configured. In general it should be solvable via some sort of configuration so happy to dig in there.
I honestly don't think I have an OTEL config. The VM uses a minimal Ubuntu 24.04 image, I installed gh and uv to clone and install the project's dependencies, and then ran uv run phoenix serve and my app. When starting my app, this is what I get:
🔭 OpenTelemetry Tracing Details 🔭
| Phoenix Project: mabool
| Span Processor: BatchSpanProcessor
| Collector Endpoint: http://localhost:6006/v1/traces
| Transport: HTTP + protobuf
| Transport Headers: {}
|
| Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|
| `register` has set this TracerProvider as the global OpenTelemetry default.
| To disable this behavior, call `register` with `set_global_tracer_provider=False`.
and the VM doesn't seem to have /etc/otel* stuff in it. The app is not (currently) open-source, so I can't share it publicly just yet, but if there's any more information I can give, I'll be glad to share it! I remember trying to play around with environment variables which I thought would control the batch queue size and send size, but I can't find the source of that. It didn't seem to help, but maybe I was just way off in the config.
Got it, yeah I'd definitely think about using one of the phoenix images at some point. It will make the upgrade process a lot easier and deterministic. We have a helm chart too if you are using Kubernetes. The other side benefit will be that you won't need to install arize-phoenix as a dependency. You can just install the OTEL instrumentation + client if you want to send evals / feedback. With regards to batching this is the config options you have: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#batch-span-processor It's worth tuning these based on the throughput you want.
