I have been struggling with LLM Judges. Variance, inconsistency, latency, cost, etc. - many problems. Anyone figured it out here? and using LLM judges like a breeze?
Jane W. are you using a binary (two choices) or categorical judge? Number ranges can cause problems Have you used explanations, they can help guide you on why decisions are made. You can change template to help fix problems. Above are probably the 2 power user approaches to try first to help with variance.
On cost, I would build tests/human annotations to judge your judge. Then move to the “mini” versions of models for cost/latency - seeing if you maintain performance
