Troubleshooting Deadlines Exceeded Errors in Self-Hosted Phoenix gRPC

·Jun 23, 2025 04:54 PM

When trying to use a self-hosted Phoenix instance with gRPC I'm getting a lot of errors like these:

Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to phoenix-arize:4317, retrying in 1s.

The retries get increasingly larger, it looks like some exponential backoff continues to retry and failing repeatedly. I've tried tweaking these env vars:

OTEL_EXPORTER_OTLP_TIMEOUT
OTEL_EXPORTER_OTLP_TRACES_TIMEOUT

OTEL_BSP_MAX_QUEUE_SIZE
OTEL_BSP_SCHEDULE_DELAY
OTEL_BSP_MAX_EXPORT_BATCH_SIZE

Nothing seems to help. Any ideas?

31 comments

· Sorted by Oldest

RunLLM
APP
·
This error usually means your client cannot reach the Phoenix gRPC collector at the specified host/port, or the Phoenix server is not listening on 4317 for gRPC. By default, Phoenix exposes gRPC on port 4317, but you must ensure the port is open, mapped, and not blocked by firewalls or misconfigured networking (especially in Docker or Kubernetes setups). Also, make sure your client and server are using the same protocol (gRPC, not HTTP) and that the endpoint is correct (e.g., http://localhost:4317 for local, or the correct service name in k8s) <https://arize.com/docs/phoenix/learn/resources/faqs/what-is-my-phoenix-endpoint|[source1>] <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|[source2>]. If you continue to see DEADLINE_EXCEEDED, try switching to HTTP export on port 6006 as a workaround, since some environments (like certain cloud providers or reverse proxies) may not support gRPC well. You can do this by setting the endpoint to http://<phoenix-host> :6006/v1/traces and protocol to "http/protobuf" in your tracing config <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|[source3>] <https://arize.com/docs/phoenix/learn/resources/faqs/can-i-use-grpc-for-trace-collection|[source4>]. Would you like more detail on debugging network connectivity or configuration? Sources: - <https://arize.com/docs/phoenix/learn/resources/faqs/what-is-my-phoenix-endpoint|source1> - <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|source2> - <https://arize.com/docs/phoenix/learn/resources/faqs/can-i-use-grpc-for-trace-collection|source4>
💡 Hint: Mention RunLLM in the thread for followups.
Eugene E.
·
RunLLM The connection is definitely configured correctly since only some spans are dropped due to this problem. I have tried HTTP and am still seeing dropped spans, but it manifests as a "remote closed connection" error
RunLLM
APP
·
If only some spans are dropped and you see "remote closed connection" errors even with HTTP, this often points to resource constraints or server-side issues (e.g., Phoenix running out of memory, hitting connection limits, or database bottlenecks). This is especially common with high throughput or large payloads <https://github.com/Arize-ai/phoenix/issues/4006|[source>]. Try reducing batch size and increasing schedule delay (e.g., OTEL_BSP_MAX_EXPORT_BATCH_SIZE=10, OTEL_BSP_SCHEDULE_DELAY=10000), and monitor Phoenix server logs for errors. Upgrading to the latest Phoenix version may help, as recent releases include performance improvements for large trace volumes <https://github.com/Arize-ai/phoenix/issues/4006|[source>]. Would you like more detail on server tuning or log analysis? Sources: - <https://github.com/Arize-ai/phoenix/issues/6469|https://github.com/Arize-ai/phoenix/issues/6469> - <https://github.com/Arize-ai/phoenix/issues/4921|https://github.com/Arize-ai/phoenix/issues/4921>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
Eugene E. just for my understanding the traces are arriving but you start seeing timeouts? Do you have batching configured on the exporter? Any chance you can give us a rough estimate as to the volume of data you are sending?
Eugene E.
·
Do you have batching configured on the exporter?
yes, i have tried it both ways (batch and simple) with grpc and http
Any chance you can give us a rough estimate as to the volume of data you are sending?
i'm trying to instrument llm eval runners - the current setup uses 8 parallel runners, each with its own exporter, so in the worst case it can submit 8 batches at a time. these are llm traces and traces instrumenting a complex modular agent workflow, there are llm traces at the bottom but the overall tree contains other, higher order logical nodes. these higher nodes can contain llm prompts as well, although i have already pruned almost all the fat from them and it didn't help. the collector is on a c4.8xlarge instance. i've tried running two collector containers in a docker compose - one only for writes and one for reads. both are backed by the same postgres
Eugene E.
·
i'm seeing what look like timeouts on the exporter side and the traces don't look like they arrive in the collector
Eugene E.
·
i don't see anything unusual in the collector logs although they are very basic and the phoenix docs don't seem to have a verbose log level config
Mikyo
·
So the traces never show up - is that correct? If so I wonder if it's networking or auth related?
Eugene E.
·
most traces show up, about 80%
Mikyo
·
Oh I see. Hmm, let me do a bit more digging then. Thanks for the context. Appreciate it.
Eugene E.
·
tvm
Eugene E.
·
for some additional context, this is how i create things:
tracer_provider = register( project_name=PHOENIX_APP_NAME, # protocol="http/protobuf", # protocol="grpc", auto_instrument=True, batch=True, set_global_tracer_provider=False, ) PHOENIX_TRACER = tracer_provider.get_tracer(__name__)
i turned off set_global_tracer_provider because we have other tracing in other parts of our code that's not LLM-related and goes to DD, those tracers have their own provider and i wanted to keep them separate, not sure if this helps at all
Eugene E.
·
to clarify - i've turned off the other tracing and the problem is still happening

Mikyo

That is helpful. For things like DD export are you running an OTEL agent? This might give you more nobs to process the spans before final export and offload the direct connection from your application to the phoenix instance. Another thing you could play with is the batch size. It might be that you are saturating the connections to Phoenix and need to batch even higher.

# export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

from opentelemetry import trace as trace_api
from phoenix.otel import TracerProvider, BatchSpanProcessor

tracer_provider = TracerProvider()
batch_processor = BatchSpanProcessor()
tracer_provider.add_span_processor(batch_processor)

https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

maxQueueSize - the maximum queue size. After the size is reached spans are dropped. The default value is 2048.
scheduledDelayMillis - the maximum delay interval in milliseconds between two consecutive exports. The default value is 5000.
exportTimeoutMillis - how long the export can run before it is cancelled. The default value is 30000.
maxExportBatchSize - the maximum batch size of every export. It must be smaller or equal to maxQueueSize. If the queue reaches maxExportBatchSize a batch will be exported even if scheduledDelayMillis milliseconds have not elapsed. The default value is 512.

Eugene E.

can i change these with env vars?

OTEL_BSP_MAX_QUEUE_SIZE
OTEL_BSP_SCHEDULE_DELAY
OTEL_BSP_MAX_EXPORT_BATCH_SIZE

RunLLM
APP
·
This error usually means your client cannot reach the Phoenix gRPC collector at the specified host/port, or the Phoenix server is not listening on 4317 for gRPC. By default, Phoenix exposes gRPC on port 4317, but you must ensure the port is open, mapped, and not blocked by firewalls or misconfigured networking (especially in Docker or Kubernetes setups). Also, make sure your client and server are using the same protocol (gRPC, not HTTP) and that the endpoint is correct (e.g., http://localhost:4317 for local, or the correct service name in k8s) <https://arize.com/docs/phoenix/learn/resources/faqs/what-is-my-phoenix-endpoint|[source1>] <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|[source2>]. If you continue to see DEADLINE_EXCEEDED, try switching to HTTP export on port 6006 as a workaround, since some environments (like certain cloud providers or reverse proxies) may not support gRPC well. You can do this by setting the endpoint to http://<phoenix-host> :6006/v1/traces and protocol to "http/protobuf" in your tracing config <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|[source3>] <https://arize.com/docs/phoenix/learn/resources/faqs/can-i-use-grpc-for-trace-collection|[source4>]. Would you like more detail on debugging network connectivity or configuration? Sources: - <https://arize.com/docs/phoenix/learn/resources/faqs/what-is-my-phoenix-endpoint|source1> - <https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-using-phoenix-otel|source2> - <https://arize.com/docs/phoenix/learn/resources/faqs/can-i-use-grpc-for-trace-collection|source4>
💡 Hint: Mention RunLLM in the thread for followups.
Eugene E.
·
RunLLM The connection is definitely configured correctly since only some spans are dropped due to this problem. I have tried HTTP and am still seeing dropped spans, but it manifests as a "remote closed connection" error
RunLLM
APP
·
If only some spans are dropped and you see "remote closed connection" errors even with HTTP, this often points to resource constraints or server-side issues (e.g., Phoenix running out of memory, hitting connection limits, or database bottlenecks). This is especially common with high throughput or large payloads <https://github.com/Arize-ai/phoenix/issues/4006|[source>]. Try reducing batch size and increasing schedule delay (e.g., OTEL_BSP_MAX_EXPORT_BATCH_SIZE=10, OTEL_BSP_SCHEDULE_DELAY=10000), and monitor Phoenix server logs for errors. Upgrading to the latest Phoenix version may help, as recent releases include performance improvements for large trace volumes <https://github.com/Arize-ai/phoenix/issues/4006|[source>]. Would you like more detail on server tuning or log analysis? Sources: - <https://github.com/Arize-ai/phoenix/issues/6469|https://github.com/Arize-ai/phoenix/issues/6469> - <https://github.com/Arize-ai/phoenix/issues/4921|https://github.com/Arize-ai/phoenix/issues/4921>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
Eugene E. just for my understanding the traces are arriving but you start seeing timeouts? Do you have batching configured on the exporter? Any chance you can give us a rough estimate as to the volume of data you are sending?
Eugene E.
·
Do you have batching configured on the exporter?
yes, i have tried it both ways (batch and simple) with grpc and http
Any chance you can give us a rough estimate as to the volume of data you are sending?
i'm trying to instrument llm eval runners - the current setup uses 8 parallel runners, each with its own exporter, so in the worst case it can submit 8 batches at a time. these are llm traces and traces instrumenting a complex modular agent workflow, there are llm traces at the bottom but the overall tree contains other, higher order logical nodes. these higher nodes can contain llm prompts as well, although i have already pruned almost all the fat from them and it didn't help. the collector is on a c4.8xlarge instance. i've tried running two collector containers in a docker compose - one only for writes and one for reads. both are backed by the same postgres
Eugene E.
·
i'm seeing what look like timeouts on the exporter side and the traces don't look like they arrive in the collector
Eugene E.
·
i don't see anything unusual in the collector logs although they are very basic and the phoenix docs don't seem to have a verbose log level config
Mikyo
·
So the traces never show up - is that correct? If so I wonder if it's networking or auth related?
Eugene E.
·
most traces show up, about 80%
Mikyo
·
Oh I see. Hmm, let me do a bit more digging then. Thanks for the context. Appreciate it.
Eugene E.
·
tvm
Eugene E.
·
for some additional context, this is how i create things:
tracer_provider = register( project_name=PHOENIX_APP_NAME, # protocol="http/protobuf", # protocol="grpc", auto_instrument=True, batch=True, set_global_tracer_provider=False, ) PHOENIX_TRACER = tracer_provider.get_tracer(__name__)
i turned off set_global_tracer_provider because we have other tracing in other parts of our code that's not LLM-related and goes to DD, those tracers have their own provider and i wanted to keep them separate, not sure if this helps at all
Eugene E.
·
to clarify - i've turned off the other tracing and the problem is still happening

Mikyo

# export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

from opentelemetry import trace as trace_api
from phoenix.otel import TracerProvider, BatchSpanProcessor

tracer_provider = TracerProvider()
batch_processor = BatchSpanProcessor()
tracer_provider.add_span_processor(batch_processor)

https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor

maxQueueSize - the maximum queue size. After the size is reached spans are dropped. The default value is 2048.
scheduledDelayMillis - the maximum delay interval in milliseconds between two consecutive exports. The default value is 5000.
exportTimeoutMillis - how long the export can run before it is cancelled. The default value is 30000.
maxExportBatchSize - the maximum batch size of every export. It must be smaller or equal to maxQueueSize. If the queue reaches maxExportBatchSize a batch will be exported even if scheduledDelayMillis milliseconds have not elapsed. The default value is 512.

Eugene E.

can i change these with env vars?

OTEL_BSP_MAX_QUEUE_SIZE
OTEL_BSP_SCHEDULE_DELAY
OTEL_BSP_MAX_EXPORT_BATCH_SIZE