Aggregating Results of Annotations for Human Evaluations

Ron C. · 2025-07-23T14:34:13.077Z

I want to do some human driven evals, seems like annotations is the right way to go. Is there a way to see aggregated results of annotations? Similar to how we see it for evals: I added some annotations, but don't see any high level summary:

Aggregating Results of Annotations for Human Evaluations | Arize AI Community

14 comments

· Sorted by Oldest

RunLLM
APP
·
Yes, Phoenix supports viewing aggregated annotation results. Annotation summaries (such as mean scores and label distributions) are available in the UI and can be shown in tables or popovers as you navigate spans or traces. This aggregation is similar to how evals are summarized, and recent updates have improved the display of aggregate annotation metrics for both spans and experiments, including mean values and label distributions for each annotation type. If you don't see summaries, ensure you're using a recent version and check the Annotations tab or summary columns in the UI[^1^][^2^][^3^][^4^]. Would you like more detail on where to find these summaries or how to configure them? Sources: - <https://github.com/Arize-ai/phoenix/issues/7489|https://github.com/Arize-ai/phoenix/issues/7489> - <https://github.com/Arize-ai/phoenix/issues/3496|https://github.com/Arize-ai/phoenix/issues/3496> - <https://github.com/Arize-ai/phoenix/issues/7188|https://github.com/Arize-ai/phoenix/issues/7188>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Ron C., we're also working on dashboards that might help, Roger Y. can give you more details about that when he's around
Ron C.
·
Ahh, this is for spans and traces, I am more interested in specific experiment runs
Ron C.
·
Because we can add the labels, but I can't see the aggregated results for that experiment.
Dustin N.
·
ah okay, well the new (relatively) phoenix client has the ability to pull annotations in different formats so you can do analysis locally if that might help
Roger Y.
·
I can’t see the aggregated results for that experiment.
in your first screenshot above, is evaluate_rule_accuracy an average for the 24 run count?
Roger Y.
·
or are you saying you want to see the average per run (over the experiments)?
Ron C.
·
yes, but that is a regular eval that i loaded in with LLM as a judge (clearly the rules aren't very accurate 😅) But I want to go in and add my own eval of how accurate the rule is based on my human opinion, because the LLM as a judge isn't so good
Ron C.
·
so I want to keep track of if I am improving based on human feedback with subsequent experiments
Ron C.
·
So I am able to add the annotations to individual experiment example traces, but no way of aggregating across the entire experiment
Ron C.
·
i.e take the average
Roger Y.
·
oh ok. i think human annotation of experiment is not implemented yet cc Xander S.
👍1
Roger Y.
·
i created an enhancement issue for us to track this
Roger Y.
·
i think what we may be able to do is to pull the annotations from the spans and display them on the experiment summary page

RunLLM
APP
·
Yes, Phoenix supports viewing aggregated annotation results. Annotation summaries (such as mean scores and label distributions) are available in the UI and can be shown in tables or popovers as you navigate spans or traces. This aggregation is similar to how evals are summarized, and recent updates have improved the display of aggregate annotation metrics for both spans and experiments, including mean values and label distributions for each annotation type. If you don't see summaries, ensure you're using a recent version and check the Annotations tab or summary columns in the UI[^1^][^2^][^3^][^4^]. Would you like more detail on where to find these summaries or how to configure them? Sources: - <https://github.com/Arize-ai/phoenix/issues/7489|https://github.com/Arize-ai/phoenix/issues/7489> - <https://github.com/Arize-ai/phoenix/issues/3496|https://github.com/Arize-ai/phoenix/issues/3496> - <https://github.com/Arize-ai/phoenix/issues/7188|https://github.com/Arize-ai/phoenix/issues/7188>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Ron C., we're also working on dashboards that might help, Roger Y. can give you more details about that when he's around
Ron C.
·
Ahh, this is for spans and traces, I am more interested in specific experiment runs
Ron C.
·
Because we can add the labels, but I can't see the aggregated results for that experiment.
Dustin N.
·
ah okay, well the new (relatively) phoenix client has the ability to pull annotations in different formats so you can do analysis locally if that might help
Roger Y.
·
I can’t see the aggregated results for that experiment.
in your first screenshot above, is evaluate_rule_accuracy an average for the 24 run count?
Roger Y.
·
or are you saying you want to see the average per run (over the experiments)?
Ron C.
·
yes, but that is a regular eval that i loaded in with LLM as a judge (clearly the rules aren't very accurate 😅) But I want to go in and add my own eval of how accurate the rule is based on my human opinion, because the LLM as a judge isn't so good
Ron C.
·
so I want to keep track of if I am improving based on human feedback with subsequent experiments
Ron C.
·
So I am able to add the annotations to individual experiment example traces, but no way of aggregating across the entire experiment
Ron C.
·
i.e take the average
Roger Y.
·
oh ok. i think human annotation of experiment is not implemented yet cc Xander S.
👍1
Roger Y.
·
i created an enhancement issue for us to track this
Roger Y.
·
i think what we may be able to do is to pull the annotations from the spans and display them on the experiment summary page