How to Make LLM Evaluation Traces Fully Visible in Experiments

Igor B. · 2025-05-22T13:17:15.315Z

Hi all, who knows how to make evaluation traces (llm evaluator of experiment) fully visible (not only output)? I have this setup, which should log all traces tracer_provider = register(protocol="http/protobuf", project_name=project_name, auto_instrument=True)

10 comments

· Sorted by Oldest

RunLLM
APP
·
To make evaluation traces (LLM evaluator of experiment) fully visible in Phoenix—including more than just the output column—you should log your evaluation results as annotations on spans. Evaluations can include label, score, and explanation fields, and when logged using `SpanEvaluations`, these will appear as annotations in the Phoenix UI, making them fully visible alongside the trace output. You can use built-in or custom evaluators, and log results with: ``` from phoenix.trace import SpanEvaluations px.Client().log_evaluations( SpanEvaluations(eval_name="Your Eval Name", dataframe=your_eval_df) ) ``` Make sure your evaluation DataFrame includes the `span_id` and desired columns (label, score, explanation) for full visibility. For more, see <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-interact-with-traces/llm-evaluations.md|Log Evaluation Results> and <https://docs.arize.com/phoenix/learn/evaluation/evaluators#B%3A7|Evaluators Overview>. Would you like more detail on how to structure your DataFrame or customize the UI display? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-interact-with-traces/llm-evaluations.md|Log Evaluation Results> - <https://docs.arize.com/phoenix/learn/evaluation/evaluators#B%3A7|Evaluators Overview>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
evaluation run is itselt a llm invokation, which should be traced, but it shows only output, not full llm call details
Igor B.
·
Hi John G. yes, sure, all installed

Igor B.

phoenix_eval_model = OpenAIModel(model="gpt-4o", temperature=0.0,)

pip show openinference-instrumentation-openai
Name: openinference-instrumentation-openai
Version: 0.1.28
Summary: OpenInference OpenAI Instrumentation

👍1

Igor B.
·
yes, I see that project
Igor B.
·
but inside it, trace consists only of result, without llm invocation details
Igor B.
·
thanks, let me debug that

Igor B.

Hi John G. what versions of openinference-instrumentation-openai openai arize-phoenix do you have? This sample successfully trace Grok, but not Openai

from openinference.semconv.resource import ResourceAttributes
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from phoenix.config import get_env_host, get_env_port
from phoenix.otel import register

tracer_provider = register(protocol="http/protobuf", project_name="test_project", auto_instrument=True)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)
collector_endpoint = f"http://{get_env_host()}:{get_env_port()}/v1/traces"
span_exporter = OTLPSpanExporter(endpoint=collector_endpoint)
simple_span_processor = SimpleSpanProcessor(span_exporter=span_exporter)
trace.get_tracer_provider().add_span_processor(simple_span_processor)

import os
import openai

OPENAI_API_KEY=''

client = openai.OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Write a haiku."}],
    )
print(response.choices[0].message.content)


# import os
# from groq import Groq


# GROQ_API_KEY = ""
# client = Groq(
#     api_key=GROQ_API_KEY
# )

# chat_completion = client.chat.completions.create(
#     messages=[
#         {
#             "role": "user",
#             "content": "Explain the importance of low latency LLMs",
#         }
#     ],
#     model="llama3-70b-8192",
# )
# print(chat_completion.choices[0].message.content)

Igor B.
·
Thanks! let me test it
🙌1
Igor B.
·
John G. many thanks! I have recreated venv with specified libs you have mentioned and now I see openai traces 🍾
🙌1

RunLLM
APP
·
To make evaluation traces (LLM evaluator of experiment) fully visible in Phoenix—including more than just the output column—you should log your evaluation results as annotations on spans. Evaluations can include label, score, and explanation fields, and when logged using `SpanEvaluations`, these will appear as annotations in the Phoenix UI, making them fully visible alongside the trace output. You can use built-in or custom evaluators, and log results with: ``` from phoenix.trace import SpanEvaluations px.Client().log_evaluations( SpanEvaluations(eval_name="Your Eval Name", dataframe=your_eval_df) ) ``` Make sure your evaluation DataFrame includes the `span_id` and desired columns (label, score, explanation) for full visibility. For more, see <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-interact-with-traces/llm-evaluations.md|Log Evaluation Results> and <https://docs.arize.com/phoenix/learn/evaluation/evaluators#B%3A7|Evaluators Overview>. Would you like more detail on how to structure your DataFrame or customize the UI display? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-interact-with-traces/llm-evaluations.md|Log Evaluation Results> - <https://docs.arize.com/phoenix/learn/evaluation/evaluators#B%3A7|Evaluators Overview>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
evaluation run is itselt a llm invokation, which should be traced, but it shows only output, not full llm call details
Igor B.
·
Hi John G. yes, sure, all installed

Igor B.

phoenix_eval_model = OpenAIModel(model="gpt-4o", temperature=0.0,)

pip show openinference-instrumentation-openai
Name: openinference-instrumentation-openai
Version: 0.1.28
Summary: OpenInference OpenAI Instrumentation

👍1

Igor B.
·
yes, I see that project
Igor B.
·
but inside it, trace consists only of result, without llm invocation details
Igor B.
·
thanks, let me debug that

Igor B.

Hi John G. what versions of openinference-instrumentation-openai openai arize-phoenix do you have? This sample successfully trace Grok, but not Openai

from openinference.semconv.resource import ResourceAttributes
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from phoenix.config import get_env_host, get_env_port
from phoenix.otel import register

tracer_provider = register(protocol="http/protobuf", project_name="test_project", auto_instrument=True)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)
collector_endpoint = f"http://{get_env_host()}:{get_env_port()}/v1/traces"
span_exporter = OTLPSpanExporter(endpoint=collector_endpoint)
simple_span_processor = SimpleSpanProcessor(span_exporter=span_exporter)
trace.get_tracer_provider().add_span_processor(simple_span_processor)

import os
import openai

OPENAI_API_KEY=''

client = openai.OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Write a haiku."}],
    )
print(response.choices[0].message.content)


# import os
# from groq import Groq


# GROQ_API_KEY = ""
# client = Groq(
#     api_key=GROQ_API_KEY
# )

# chat_completion = client.chat.completions.create(
#     messages=[
#         {
#             "role": "user",
#             "content": "Explain the importance of low latency LLMs",
#         }
#     ],
#     model="llama3-70b-8192",
# )
# print(chat_completion.choices[0].message.content)

Igor B.
·
Thanks! let me test it
🙌1
Igor B.
·
John G. many thanks! I have recreated venv with specified libs you have mentioned and now I see openai traces 🍾
🙌1