I want to do some human driven evals, seems like annotations is the right way to go. Is there a way to see aggregated results of annotations? Similar to how we see it for evals: I added some annotations, but don't see any high level summary:
馃挕聽Hint: Mention RunLLM in the thread for followups.
Ahh, this is for spans and traces, I am more interested in specific experiment runs
Because we can add the labels, but I can't see the aggregated results for that experiment.
ah okay, well the new (relatively) phoenix client has the ability to pull annotations in different formats so you can do analysis locally if that might help
I can鈥檛 see the aggregated results for that experiment.
in your first screenshot above, is evaluate_rule_accuracy an average for the 24 run count?
or are you saying you want to see the average per run (over the experiments)?
yes, but that is a regular eval that i loaded in with LLM as a judge (clearly the rules aren't very accurate 馃槄) But I want to go in and add my own eval of how accurate the rule is based on my human opinion, because the LLM as a judge isn't so good
so I want to keep track of if I am improving based on human feedback with subsequent experiments
So I am able to add the annotations to individual experiment example traces, but no way of aggregating across the entire experiment
i.e take the average
i created an enhancement issue for us to track this
i think what we may be able to do is to pull the annotations from the spans and display them on the experiment summary page
