Skip to main content

Arize AI Community Icon

Home
Events

Arize AX Releases
Arize AX Support
Arize News
Arize Observe
Discussions
Introductions
Phoenix Support

Powered by Tightknit

Overcoming Challenges with LLM Judges: Seeking Solutions

·Mar 26, 2025 01:15 AM

I have been struggling with LLM Judges. Variance, inconsistency, latency, cost, etc. - many problems. Anyone figured it out here? and using LLM judges like a breeze?

6 comments

· Sorted by Oldest

Mikyo
·
Hey Jane W. - definitely can be a hard process depending on the type of evals your trying to do. What type of evals are you trying to build? What have you tried?
Jason
·
Jane W. are you using a binary (two choices) or categorical judge? Number ranges can cause problems Have you used explanations, they can help guide you on why decisions are made. You can change template to help fix problems. Above are probably the 2 power user approaches to try first to help with variance.
Jason
·
On cost, I would build tests/human annotations to judge your judge. Then move to the “mini” versions of models for cost/latency - seeing if you maintain performance
Jane W.
·
Thank you Mikyo and Jason. I have worked several use cases. Sometimes need ranges and other times need classifications. Explanations help yes quite a lot - but not as volume increases. Then we need humans (or me), I like your idea of “mini” versions and will explore more in that direction.
Jason
·
Jane W. Great, do be wary of continuous ranges as Evals, they are not linear in have some strange end of line properties. We tend to like 1-star, 2-star, 3-star to give fixed categorical buckets but not a 0-1 range. We do like turning a binary into a 0 or 1 value so we can average.
✅1
Dhruv N.
·
Jane W. curious where you landed on this. At google we’d lean more on encoder style models, rather than generative decoders to get much more stable scores that we could tune with human ratings