When trying to use a self-hosted Phoenix instance with gRPC I'm getting a lot of errors like these:
Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to phoenix-arize:4317, retrying in 1s.The retries get increasingly larger, it looks like some exponential backoff continues to retry and failing repeatedly. I've tried tweaking these env vars:
OTEL_EXPORTER_OTLP_TIMEOUT
OTEL_EXPORTER_OTLP_TRACES_TIMEOUT
OTEL_BSP_MAX_QUEUE_SIZE
OTEL_BSP_SCHEDULE_DELAY
OTEL_BSP_MAX_EXPORT_BATCH_SIZENothing seems to help. Any ideas?
This error usually means your client cannot reach the Phoenix gRPC collector at the specified host/port, or the Phoenix server is not listening on 4317 for gRPC. By default, Phoenix exposes gRPC on port 4317, but you must ensure the port is open, mapped, and not blocked by firewalls or misconfigured networking (especially in Docker or Kubernetes setups). Also, make sure your client and server are using the same protocol (gRPC, not HTTP) and that the endpoint is correct (e.g., http://localhost:4317 for local, or the correct service name in k8s) [source1] [source2].If you continue to see DEADLINE_EXCEEDED, try switching to HTTP export on port 6006 as a workaround, since some environments (like certain cloud providers or reverse proxies) may not support gRPC well. You can do this by setting the endpoint to http://<phoenix-host> :6006/v1/traces and protocol to "http/protobuf" in your tracing config [source3] [source4]. Would you like more detail on debugging network connectivity or configuration?Sources: - source1 - source2 - source4
馃挕聽Hint: Mention RunLLM in the thread for followups.
If only some spans are dropped and you see "remote closed connection" errors even with HTTP, this often points to resource constraints or server-side issues (e.g., Phoenix running out of memory, hitting connection limits, or database bottlenecks). This is especially common with high throughput or large payloads [source].Try reducing batch size and increasing schedule delay (e.g., OTEL_BSP_MAX_EXPORT_BATCH_SIZE=10, OTEL_BSP_SCHEDULE_DELAY=10000), and monitor Phoenix server logs for errors. Upgrading to the latest Phoenix version may help, as recent releases include performance improvements for large trace volumes [source]. Would you like more detail on server tuning or log analysis?Sources: - https://github.com/Arize-ai/phoenix/issues/6469 - https://github.com/Arize-ai/phoenix/issues/4921
馃挕聽Hint: Mention RunLLM in the thread for followups.
Do you have batching configured on the exporter?
yes, i have tried it both ways (batch and simple) with grpc and http
Any chance you can give us a rough estimate as to the volume of data you are sending?
i'm trying to instrument llm eval runners - the current setup uses 8 parallel runners, each with its own exporter, so in the worst case it can submit 8 batches at a time. these are llm traces and traces instrumenting a complex modular agent workflow, there are llm traces at the bottom but the overall tree contains other, higher order logical nodes. these higher nodes can contain llm prompts as well, although i have already pruned almost all the fat from them and it didn't help. the collector is on a c4.8xlarge instance. i've tried running two collector containers in a docker compose - one only for writes and one for reads. both are backed by the same postgres
i'm seeing what look like timeouts on the exporter side and the traces don't look like they arrive in the collector
i don't see anything unusual in the collector logs although they are very basic and the phoenix docs don't seem to have a verbose log level config
So the traces never show up - is that correct? If so I wonder if it's networking or auth related?
most traces show up, about 80%
Oh I see. Hmm, let me do a bit more digging then. Thanks for the context. Appreciate it.
tvm
for some additional context, this is how i create things:
tracer_provider = register(
project_name=PHOENIX_APP_NAME,
# protocol="http/protobuf",
# protocol="grpc",
auto_instrument=True,
batch=True,
set_global_tracer_provider=False,
)
PHOENIX_TRACER = tracer_provider.get_tracer(__name__)i turned off set_global_tracer_provider because we have other tracing in other parts of our code that's not LLM-related and goes to DD, those tracers have their own provider and i wanted to keep them separate, not sure if this helps at all
to clarify - i've turned off the other tracing and the problem is still happening
That is helpful. For things like DD export are you running an OTEL agent? This might give you more nobs to process the spans before final export and offload the direct connection from your application to the phoenix instance. Another thing you could play with is the batch size. It might be that you are saturating the connections to Phoenix and need to batch even higher.
# export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
from opentelemetry import trace as trace_api
from phoenix.otel import TracerProvider, BatchSpanProcessor
tracer_provider = TracerProvider()
batch_processor = BatchSpanProcessor()
tracer_provider.add_span_processor(batch_processor)
https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor
maxQueueSize - the maximum queue size. After the size is reached spans are dropped. The default value is 2048.
scheduledDelayMillis - the maximum delay interval in milliseconds between two consecutive exports. The default value is 5000.
exportTimeoutMillis - how long the export can run before it is cancelled. The default value is 30000.
maxExportBatchSize - the maximum batch size of every export. It must be smaller or equal to maxQueueSize. If the queue reaches maxExportBatchSize a batch will be exported even if scheduledDelayMillis milliseconds have not elapsed. The default value is 512.can i change these with env vars?
OTEL_BSP_MAX_QUEUE_SIZE
OTEL_BSP_SCHEDULE_DELAY
OTEL_BSP_MAX_EXPORT_BATCH_SIZE