Logging AI Experiment Results in Phoenix with Scikit-Learn

·Aug 27, 2025 10:15 PM

Hello Phoenix Team, I have a dataframe which has AI predictions as well as the ground truth label columns in it. I am able to upload it as dataset. Now, I calculate accuracy, precision, recall, f1 using scikit learn library. How can I log those results as Experiment in phoenix for that dataset?

9 comments

· Sorted by Oldest

RunLLM
APP
·
You can log your own calculated global metrics (accuracy, precision, recall, F1) to Phoenix by creating a DataFrame with these metrics and a dummy span_id, then using `px.Client().log_evaluations` with `SpanEvaluations`. This will display your metrics in the Phoenix UI as part of the experiment results. Example: ``` import pandas as pd from phoenix.trace import SpanEvaluations import phoenix as px metrics_df = pd.DataFrame({ 'span_id': [1], # Dummy span_id for global metrics 'accuracy': [accuracy], 'precision': [precision], 'recall': [recall], 'f1': [f1] }) client = px.Client() client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df)) ``` Phoenix will display these metrics in the UI if they are part of the logged evaluations. For more, see <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|this Slack thread>. Would you like more detail or a full example? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|Phoenix Slack: Logging global metrics>
💡 Hint: Mention RunLLM in the thread for followups.
Prerit S.
·
I have logged my dataset to Phoenix...I struggle to implement the whole code because I don't understand how I can retrieve the entire dataframe inside the task function? Based on docs, it seems like task iterates over each row of the dataframe...but, while calculating metrics, I calculate it over the entire dataframe at once..

Prerit S.

import pandas as pd
from phoenix.trace import SpanEvaluations
import phoenix as px

metrics_df = pd.DataFrame({
    'span_id': [1],  # Dummy span_id for global metrics
    'accuracy': [accuracy],
    'precision': [precision],
    'recall': [recall],
    'f1': [f1]
})

client = px.Client()
client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df))

RunLLM I am trying to use this code snippet; however, I struggle to use new self contained client in place of px.Client() in this code.. Can you please tell me how can I log evaluations with new self-contained client?

RunLLM
APP
·
The new self-contained client in Phoenix (as of recent versions) does not support logging evaluations in the same way as `px.Client().log_evaluations`. The `arize-phoenix-client` package (e.g., v1.13.2) does not have a `log_evaluations` method or an equivalent API for logging evaluations directly—this functionality is only available in the main `phoenix` package, not in the standalone client. You must use the main `phoenix` package and `px.Client().log_evaluations` for this workflow. See <https://github.com/Arize-ai/phoenix/issues/8746|this GitHub issue> for details. If you are restricted to the self-contained client, you can only log span annotations using `log_span_annotations_dataframe`, not full evaluation DataFrames. For more, see <https://github.com/Arize-ai/phoenix/issues/8746|this comment>. Would you like more detail or an example using `log_span_annotations_dataframe`? Sources: - <https://github.com/Arize-ai/phoenix/issues/8746|GitHub Issue: How to log evaluations?>
💡 Hint: Mention RunLLM in the thread for followups.

Prerit S.

import pandas as pd
from phoenix.trace import SpanEvaluations
import phoenix as px

metrics_df = pd.DataFrame({
    'span_id': [1],  # Dummy span_id for global metrics
    'accuracy': [accuracy],
    'precision': [precision],
    'recall': [recall],
    'f1': [f1]
})

client = px.Client()
client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df))

RunLLM If I use the code above to log_evaluations, where in the Phonix UI can I find them?

RunLLM
APP
·
If you log global metrics like accuracy, precision, recall, and F1 using `log_evaluations` as shown, they will appear in the Phoenix UI as evaluation annotations attached to the relevant project or experiment. These metrics are displayed in the evaluation results or annotation panels, typically at the top of the experiment or trace view, depending on your workflow and UI version. For experiments, summary metrics are shown in the experiment table (<https://github.com/Arize-ai/phoenix/pull/3575|see screenshot>). Note: The metrics will only be visible if your DataFrame is linked to a valid project/session and the UI supports displaying global (non-row-level) metrics. For more, see <https://github.com/Arize-ai/phoenix/pull/3575|this PR> and <https://arize.com/docs/phoenix/tracing/how-to-tracing/llm-evaluations|docs>. Would you like more detail or a UI walkthrough? Sources: - <https://github.com/Arize-ai/phoenix/pull/3575|Phoenix PR: experiment evaluation summaries in the table>
💡 Hint: Mention RunLLM in the thread for followups.
Prerit S.
·
Hello Priyan, Thank you for the response. Can you send me some pointers or code snippets on how to log metrics to all datapoints in my dataset?
Prerit S.
·
Priyan, Quick question...I am using this as an example: https://arize.com/docs/phoenix/cookbook/tracing-and-annotations/generating-synthetic-datasets-for-llm-evaluators-and-agents#upload-agent-dataset What I don't understand is that how does task_function in this experiment receive input and reference arguments inside the function while running the experiment... I don't see anything being passed in run_experiment command in that example... It gets very confusing for me to understand how I actually pass any arguments to the task and evaluator functions??
Prerit S.
·
ok..got it..and how to do I access metadata columns in task and evaluator? Do I just use word metadata?
def task(input, metadata): question = input["question"] answer = metadata["experiment_id"]

RunLLM
APP
·
You can log your own calculated global metrics (accuracy, precision, recall, F1) to Phoenix by creating a DataFrame with these metrics and a dummy span_id, then using `px.Client().log_evaluations` with `SpanEvaluations`. This will display your metrics in the Phoenix UI as part of the experiment results. Example: ``` import pandas as pd from phoenix.trace import SpanEvaluations import phoenix as px metrics_df = pd.DataFrame({ 'span_id': [1], # Dummy span_id for global metrics 'accuracy': [accuracy], 'precision': [precision], 'recall': [recall], 'f1': [f1] }) client = px.Client() client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df)) ``` Phoenix will display these metrics in the UI if they are part of the logged evaluations. For more, see <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|this Slack thread>. Would you like more detail or a full example? Sources: - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|Phoenix Slack: Logging global metrics>
💡 Hint: Mention RunLLM in the thread for followups.
Prerit S.
·
I have logged my dataset to Phoenix...I struggle to implement the whole code because I don't understand how I can retrieve the entire dataframe inside the task function? Based on docs, it seems like task iterates over each row of the dataframe...but, while calculating metrics, I calculate it over the entire dataframe at once..

Prerit S.

import pandas as pd
from phoenix.trace import SpanEvaluations
import phoenix as px

metrics_df = pd.DataFrame({
    'span_id': [1],  # Dummy span_id for global metrics
    'accuracy': [accuracy],
    'precision': [precision],
    'recall': [recall],
    'f1': [f1]
})

client = px.Client()
client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df))

RunLLM
APP
·
The new self-contained client in Phoenix (as of recent versions) does not support logging evaluations in the same way as `px.Client().log_evaluations`. The `arize-phoenix-client` package (e.g., v1.13.2) does not have a `log_evaluations` method or an equivalent API for logging evaluations directly—this functionality is only available in the main `phoenix` package, not in the standalone client. You must use the main `phoenix` package and `px.Client().log_evaluations` for this workflow. See <https://github.com/Arize-ai/phoenix/issues/8746|this GitHub issue> for details. If you are restricted to the self-contained client, you can only log span annotations using `log_span_annotations_dataframe`, not full evaluation DataFrames. For more, see <https://github.com/Arize-ai/phoenix/issues/8746|this comment>. Would you like more detail or an example using `log_span_annotations_dataframe`? Sources: - <https://github.com/Arize-ai/phoenix/issues/8746|GitHub Issue: How to log evaluations?>
💡 Hint: Mention RunLLM in the thread for followups.

Prerit S.

import pandas as pd
from phoenix.trace import SpanEvaluations
import phoenix as px

metrics_df = pd.DataFrame({
    'span_id': [1],  # Dummy span_id for global metrics
    'accuracy': [accuracy],
    'precision': [precision],
    'recall': [recall],
    'f1': [f1]
})

client = px.Client()
client.log_evaluations(SpanEvaluations(eval_name="Global Metrics", dataframe=metrics_df))

RunLLM If I use the code above to log_evaluations, where in the Phonix UI can I find them?

RunLLM
APP
·
If you log global metrics like accuracy, precision, recall, and F1 using `log_evaluations` as shown, they will appear in the Phoenix UI as evaluation annotations attached to the relevant project or experiment. These metrics are displayed in the evaluation results or annotation panels, typically at the top of the experiment or trace view, depending on your workflow and UI version. For experiments, summary metrics are shown in the experiment table (<https://github.com/Arize-ai/phoenix/pull/3575|see screenshot>). Note: The metrics will only be visible if your DataFrame is linked to a valid project/session and the UI supports displaying global (non-row-level) metrics. For more, see <https://github.com/Arize-ai/phoenix/pull/3575|this PR> and <https://arize.com/docs/phoenix/tracing/how-to-tracing/llm-evaluations|docs>. Would you like more detail or a UI walkthrough? Sources: - <https://github.com/Arize-ai/phoenix/pull/3575|Phoenix PR: experiment evaluation summaries in the table>
💡 Hint: Mention RunLLM in the thread for followups.
Prerit S.
·
Hello Priyan, Thank you for the response. Can you send me some pointers or code snippets on how to log metrics to all datapoints in my dataset?
Prerit S.
·
Priyan, Quick question...I am using this as an example: https://arize.com/docs/phoenix/cookbook/tracing-and-annotations/generating-synthetic-datasets-for-llm-evaluators-and-agents#upload-agent-dataset What I don't understand is that how does task_function in this experiment receive input and reference arguments inside the function while running the experiment... I don't see anything being passed in run_experiment command in that example... It gets very confusing for me to understand how I actually pass any arguments to the task and evaluator functions??
Prerit S.
·
ok..got it..and how to do I access metadata columns in task and evaluator? Do I just use word metadata?
def task(input, metadata): question = input["question"] answer = metadata["experiment_id"]