Seeking Help with Phoenix Integration in Airflow Setup | Arize AI Community

Arize AI Community Icon

RunLLM
APP
·
·
Phoenix can be integrated with Airflow, but for reliable trace collection, the Phoenix/OpenInference instrumentation and dependencies must be installed in the Python environment of all Airflow workers and the scheduler—not just in the DAG code. This is especially important in CeleryExecutor setups, where tasks may run on different workers with isolated environments. If the instrumentation is missing from any worker, traces from tasks running there will not be sent to Phoenix, leading to the partial trace arrival you described. There is no official Phoenix + Airflow integration guide, but this deployment pattern is confirmed in community discussions and is consistent with how OpenTelemetry-based tracing works in distributed systems (source, source).Would you like more detail on how to set up the instrumentation in Airflow, or troubleshooting steps for your current setup?Sources: - Phoenix does not currently support acting as a middleware or proxy to capture OpenTelemetry traces, spans, or sessions without modifying the application code. Instrumentation must be present in the worker environment. - https://arize.com/docs/phoenix/tracing/llm-traces-1/quickstart-tracing-python#overview
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
·
Hey Matthew R. - I've never set it up in airflow jobs but maybe can help with some of the debugging. Copying dev-rel here Priyan J. Srilakshmi C.
1
Matthew R.
·
·
Yes please RunLLM, would love to have this: "Would you like more detail on how to set up the instrumentation in Airflow, or troubleshooting steps for your current setup?"
RunLLM
APP
·
·
To ensure Phoenix traces are reliably collected in Airflow, you must install all Phoenix/OpenInference dependencies (e.g., arize-phoenix-otel, openinference-instrumentation-) in the Python environment of every Airflow worker and the scheduler. Instrumentation code (e.g., from phoenix.otel import register; register(...)) should be executed at the start of any task or DAG that emits traces, before any LLM or framework code runs. This guarantees that all tasks, regardless of which worker they land on, are instrumented and can export traces to your Phoenix instance (source, source).Troubleshooting tips:* - Confirm all Airflow workers and the scheduler have the same Python packages installed (use pip freeze or requirements.txt). - Ensure environment variables like PHOENIX_COLLECTOR_ENDPOINT are set in the worker environment, not just in the DAG code. - Add a ConsoleSpanExporter in your instrumentation to verify spans are being created locally. - If using CeleryExecutor, check that the worker containers/images include all Phoenix dependencies. - Instrumentation must run before any LLM or framework import; otherwise, auto-instrumentation may not patch the libraries correctly (source, source).Would you like a sample Airflow DAG code snippet with Phoenix instrumentation?Sources: - https://github.com/Arize-ai/phoenix/issues/8380 - https://arize.com/docs/phoenix/tracing/llm-traces-1/quickstart-tracing-python#overview
💡 Hint: Mention RunLLM in the thread for followups.
Matthew R.
·
·
Yes please RunLLM
RunLLM
APP
·
·
Here’s a minimal Airflow DAG task with Phoenix instrumentation. This assumes all Airflow workers and the scheduler have the required packages installed (e.g., arize-phoenix-otel, openinference-instrumentation-openai): LBKSfrom airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime import osLBKS# Set Phoenix endpoint as an environment variable (can also be set in worker config) os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "http://your-phoenix-host:6006/v1/traces"LBKS# Import and register Phoenix tracing before any LLM/framework imports from phoenix.otel import register tracer_provider = register( project_name="airflow-llm-pipeline", auto_instrument=True, # auto-instrument supported libraries batch=True, # recommended for production )LBKSdef traced_task(**kwargs): # Your LLM or other instrumented code here import openai openai.api_key = "sk-..." # or use env var client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello from Airflow!"}], ) print(response.choices[0].message.content)LBKSwith DAG( dag_id="phoenix_airflow_example", start_date=datetime(2024, 1, 1), schedule_interval=None, catchup=False, ) as dag: run_traced = PythonOperator( task_id="run_traced", python_callable=traced_task, provide_context=True, )LBKSKey points: - Install all dependencies in every worker and scheduler environment. - Set PHOENIX_COLLECTOR_ENDPOINT in the environment (not just in code). - Import and call phoenix.otel.register before any LLM/framework imports. - Use auto_instrument=True for automatic tracing of supported libraries.Let me know if you need a requirements.txt example or more advanced setup!Sources: - https://github.com/Arize-ai/phoenix/issues/8380 - https://arize.com/docs/phoenix/tracing/llm-traces-1/quickstart-tracing-python#overview
💡 Hint: Mention RunLLM in the thread for followups.
Matthew R.
·
·
RunLLM can you provide a Pipenv and requirement.txt file as well as the needed Airflow configurations files for the Airflow workers, scheduler, and web server (and any other parts of the base Airflow installation). I don't need these for the Airflow DAGs themselves. We are using Airflow 2.7.2 and a Celery based scheduler with multiple workers.
RunLLM
APP
·
·
It could be possible that you need to ensure all Airflow components (workers, scheduler, webserver) have the same Python environment with Phoenix and OpenInference dependencies installed. Though I'm not entirely certain, it might be that you should add these to your requirements.txt and Pipfile, and ensure your Airflow config points to the correct Python environment. There is no official Phoenix+Airflow config example, but the following is based on best practices for distributed tracing and the Phoenix docs (source, source).requirements.txt```apache-airflow==2.7.2 arize-phoenix-otel> =0.12.0 openinference-instrumentation-openai> =0.1.31 opentelemetry-sdk> =1.31.1 opentelemetry-exporter-otlp> =1.31.1 openai> =1.0.0
apache-airflow = "==2.7.2" arize-phoenix-otel = "> =0.12.0" openinference-instrumentation-openai = "> =0.1.31" opentelemetry-sdk = "> =1.31.1" opentelemetry-exporter-otlp = "> =1.31.1" openai = "> =1.0.0"
- Ensure the Python environment (virtualenv, pipenv, or Docker image) used by workers, scheduler, and webserver includes the above dependencies. - Set the following environment variable in the systemd service, Dockerfile, or Kubernetes spec for each component:
```LBKS- If using authentication, also set: ```LBKS PHOENIX_API_KEY=your-api-key ```LBKSLBKS**CeleryExecutor-specific:** - All celery workers must use the same Python environment as the scheduler and webserver. - If using Docker, build a custom image with the above requirements installed. - If using systemd or supervisord, set the environment variables in the service definition.LBKSThere is no need to modify Airflow DAG files for this setup—just ensure the environment is consistent and the instrumentation is imported at the start of your tasks.LBKSWould you like a sample Dockerfile or systemd service file for this setup?LBKSSources: - [https://github.com/Arize-ai/phoenix/issues/8305](https://github.com/Arize-ai/phoenix/issues/8305) Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.
💡 Hint: Mention RunLLM in the thread for followups.
Srilakshmi C.
·
·
Hi @/Matthew Ropp, I also don't have experience with setting up airflow tests but happy to take a look. Would it be helpful to grab some time to chat through your use case and how I can help best!
Matthew R.
·
·
Thanks for the offer Srilakshmi C.! Let me get back to you. 😊
Matthew R.
·
·
Srilakshmi C. our OPS team isn't available to support that during a meeting 😞, so I don't think it would be a very useful session without them, but I could be mistaken. I don't have a local dev instance of Airflow running. Maybe I should try to figure out how to do that via Docker or something... If you think it would still be useful to meet without being able to change and Airflow server side configs or package installations, then I'll definitely grab some time on your calendar.