Choosing LLM Evals for Complex Judgments in AI Comparison
Juswanth .. If what you are trying to judge falls into complex judgement you will want to use an LLM Eval. In general, for LLMs I think you hit a brick wall pretty quickly in the non-LLM-Eval methods. If you can use non-llm-Evals, and they work for you, they are definitely simpler to understand & manage. But in many use cases a "string" compare is not going to give you the signal you need. This is especially true if you don't have "ground truth" and want to check something in production. Simple example, you want to compare a new LLM Generation with a previous generation. In this case you have ground truth, you might get away with something like a ROUGE or BLEU score on the text string. Though you might want a LLM Eval as it gives an explanation of "why" and you can tune it based on desired sensitivity in the prompt. The explanation will tell you why it thinks the two are different. Retrieval Evals we see folks heavily use the LLM Evals, there are not a lot of options that will work for production (no ground truth) that are not LLM Evals.
