Ok! We'll keep an eye out for the upcoming examples. In the meantime, we’ll implement a local workaround for the session-level evaluations. Really appreciate the support!
🔒[private user] thanks so much for the quick reply and that’s great news about the upcoming updates! At the moment, we’re using the Phoenix evals library to run span-level evaluations such as toxicity, faithfulness, and relevance, and it’s working well for our current needs. That said, we’re now looking to evaluate entire sessions (i.e., full conversations composed of multiple traces/spans) to assess aspects like:
Resolution efficiency (was the user issue resolved efficiently?)
Journey clarity (was the interaction confusing or disjointed?)
Overall helpfulness or user satisfaction
Additionally (and while I have your attention, haha 😄), we’re exploring whether Phoenix might support custom turn-based grouping, such as grouping spans by user > agent > user. We’ve noticed that evaluating each span individually sometimes masks the actual outcome of the conversation especially in cases where a full exchange is needed to assess context and quality. We've already built logic outside Phoenix to group and analyze sessions this way, but we’re very interested in bringing that into the native Phoenix eval flow even if initially through helper functions or experimental features. Happy to share examples or help test things as new features roll out
Hi everyone! I'm currently working on a use case where we've successfully implemented and logged evaluations at the span level using Phoenix — it's been working great so far. Now, we’re exploring how to score full sessions (i.e., conversations that group multiple traces). This would be highly valuable for us, as our evaluations are more meaningful at the session level than on isolated spans. I've seen in issue #2619 that session tracking was shipped with Phoenix 7.0, and I also came across references to ContextVarsRuntimeContext and context propagation via using_session(id="..."). This makes me think that some of the plumbing is in place, but I couldn’t find clear examples on how to log evaluations (annotations, metrics, etc.) at the session level via API. So, before falling back to a custom implementation outside of Phoenix, I wanted to ask:
Has anyone successfully logged session-level evaluations with Phoenix 7.0+?
Is this officially supported in the latest public release, or are there recommended workarounds?
If not yet public, is there a roadmap or documentation for this?
I’ve read the session setup guide and other docs but might have missed something. I’m still getting up to speed with Phoenix — any guidance would be truly appreciated Thanks in advance!
Session-level Evaluations Support in Phoenix
