I want to do some human driven evals, seems like annotations is the right way to go. Is there a way to see aggregated results of annotations? Similar to how we see it for evals: I added some annotations, but don't see any high level summary:
Yes, Phoenix supports viewing aggregated annotation results. Annotation summaries (such as mean scores and label distributions) are available in the UI and can be shown in tables or popovers as you navigate spans or traces. This aggregation is similar to how evals are summarized, and recent updates have improved the display of aggregate annotation metrics for both spans and experiments, including mean values and label distributions for each annotation type. If you don't see summaries, ensure you're using a recent version and check the Annotations tab or summary columns in the UI[^1^][^2^][^3^][^4^].Would you like more detail on where to find these summaries or how to configure them?Sources: - https://github.com/Arize-ai/phoenix/issues/7489 - https://github.com/Arize-ai/phoenix/issues/3496 - https://github.com/Arize-ai/phoenix/issues/7188
馃挕聽Hint: Mention RunLLM in the thread for followups.
Ahh, this is for spans and traces, I am more interested in specific experiment runs
Because we can add the labels, but I can't see the aggregated results for that experiment.
ah okay, well the new (relatively) phoenix client has the ability to pull annotations in different formats so you can do analysis locally if that might help
I can鈥檛 see the aggregated results for that experiment.
in your first screenshot above, is evaluate_rule_accuracy an average for the 24 run count?
or are you saying you want to see the average per run (over the experiments)?
yes, but that is a regular eval that i loaded in with LLM as a judge (clearly the rules aren't very accurate 馃槄) But I want to go in and add my own eval of how accurate the rule is based on my human opinion, because the LLM as a judge isn't so good
so I want to keep track of if I am improving based on human feedback with subsequent experiments
So I am able to add the annotations to individual experiment example traces, but no way of aggregating across the entire experiment
i.e take the average
i created an enhancement issue for us to track this
i think what we may be able to do is to pull the annotations from the spans and display them on the experiment summary page
