Understanding Model Evals vs. Task-Based LLM Evaluations
I'm not sure a lot of the community understands the difference between Model Evals (MMLU) and task based LLM as Judge Evals. I think most of what actual AI Engineers are using are task based Evals, thought the community talks more about model Evals. The core difference is a task a based Eval is a fixed prompt template for a task, such as Evaluating the output of an LLM summarization. The model Eval does not have a fixed prompt template, its a general test suite. The entire prompt is the test question. Model Eval is like the SAT, a lot of questions to understand how well the model can answer questions. Task Eval is like an evaluation for a job or task someone might do, its a template to evaluate an outcome of that task, applied in the same format, over and over for each piece of data.
