Hi, I'm currently running some custom evaluations using llm_classify from phoenix.evals as part of an LLM-as-a-judge workflow. Our goal is to evaluate model performance on a local DataFrame and visualize the results in Phoenix. How can we log these custom evaluations to Phoenix so that they appear in the dashboard for performance tracking? Thanks in advance!
To log your custom llm_classify evaluations to Phoenix and visualize them in the dashboard, use the px.Client().log_evaluations() method with a SpanEvaluations object containing your results DataFrame. Ensure your DataFrame includes a span_id column to associate each evaluation with the correct span. For example: LBKSfrom phoenix.trace import SpanEvaluations px.Client().log_evaluations( SpanEvaluations(eval_name="Your Eval Name", dataframe=your_eval_results_df) )LBKSThis will make your custom evals visible and sortable in the Phoenix UI, provided your DataFrame is based on spans exported from Phoenix or includes the correct identifiers. For more details and requirements, see the official log evaluation results guide and bring your own evaluator documentation.Would you like a step-by-step example or help troubleshooting a specific issue?Sources:
- Bring Your Own Evaluator
- log evaluation results guide
馃挕聽Hint: Mention RunLLM in the thread for followups.
You can find the span_id column by exporting your spans from Phoenix using the get_spans_dataframe() method, which returns a DataFrame with context.span_id as a column or index. Example: LBKSimport phoenix as px spans_df = px.Client().get_spans_dataframe() print(spans_df.columns)LBKSThe span_id is typically under the column context.span_id. Use this value to link your evaluations to the correct spans. For more, see the extract data from spans guide.Would you like more detail on extracting or using this column?Sources:
- extract data from spans guide
馃挕聽Hint: Mention RunLLM in the thread for followups.
Datasets uploaded directly to Phoenix (e.g., via upload_dataset) do not automatically include a span_id unless you explicitly add it as a column. Span IDs are typically present when datasets are created from traced spans, not from arbitrary uploads. See creating datasets from spans.Would you like guidance on how to add span IDs to your dataset?Sources:
- creating datasets from spans
馃挕聽Hint: Mention RunLLM in the thread for followups.
Certainly! Here鈥檚 a step-by-step guide to log evaluations to Phoenix from a dataset containing traced spans: 1. Export Spans DataFrame: Retrieve your spans with span IDs.
spans_df = px.Client().get_spans_dataframe()
```LBKSLBKS2. **Prepare Evaluation DataFrame**: Create a DataFrame with at least a `span_id` column (matching your spans), and your evaluation results (e.g., `label`, `score`).
```LBKS eval_df = spans_df[["context.span_id"]].copy()
eval_df["label"] = ... # your evaluation logic here
eval_df["score"] = ... # optional, your evaluation score
eval_df.rename(columns={"context.span_id": "span_id"}, inplace=True)
```LBKSLBKS3. **Log Evaluations to Phoenix**: Use `SpanEvaluations` and `log_evaluations`.
```LBKS from phoenix.trace import SpanEvaluations
px.Client().log_evaluations(
SpanEvaluations(eval_name="Your Eval Name", dataframe=eval_df)
)
```LBKSLBKSAfter this, your evaluations will appear in the Phoenix UI attached to the corresponding spans. For more, see the [official guide](https://docs.arize.com/phoenix/tracing/how-to-tracing/feedback-and-annotations/llm-evaluations).LBKSWould you like a full code example with mock evaluation logic?LBKSSources:
- [https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/evaluating-phoenix-traces.md](https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/evaluating-phoenix-traces.md)
- [https://docs.arize.com/phoenix/tracing/how-to-tracing/feedback-and-annotations/llm-evaluations#%3AR14d9uucqfkvfa%3A](https://docs.arize.com/phoenix/tracing/how-to-tracing/feedback-and-annotations/llm-evaluations#%3AR14d9uucqfkvfa%3A)
馃挕聽Hint: Mention RunLLM in the thread for followups.
We restructured our workflow and ended up on this solution. Make a brand new project Trace LLM calls Fetch the LLM calls with px_client.query_spans Then use our custom eval function on the output Finally use px.Client().log_evaluations with the eval dataframe to log it into phoenix 馃槃 Is there any other solutions that would work better?
Yes 馃敀[private user], i can join a huddle in 5 minutes?
