Understanding Model Evals vs. Task-Based LLM Evaluations

·Jan 18, 2024 07:21 AM

I'm not sure a lot of the community understands the difference between Model Evals (MMLU) and task based LLM as Judge Evals. I think most of what actual AI Engineers are using are task based Evals, thought the community talks more about model Evals. The core difference is a task a based Eval is a fixed prompt template for a task, such as Evaluating the output of an LLM summarization. The model Eval does not have a fixed prompt template, its a general test suite. The entire prompt is the test question. Model Eval is like the SAT, a lot of questions to understand how well the model can answer questions. Task Eval is like an evaluation for a job or task someone might do, its a template to evaluate an outcome of that task, applied in the same format, over and over for each piece of data.

5 comments

· Sorted by Oldest

Vibhu S.
·
I think most of what actual AI Engineers are using are task based Evals, thought the community talks more about model Evals.
Pretty good way to put it. A lot of people on twitter really focus on model evals and comparing models on standard benchmarks to see which models are “better” but that doesn’t always translate to being better at a task. I see a lot of people making cool demos with the best models from leaderboards (mostly gpt-4) but they don’t do enough task based evals of their entire system (rag pipeline, latency, guardrails, regression tests, etc.) and they really overlook how robust the overall app is things start to break in all sorts of ways. Too many people are overlooking the importance of system evals and overly focusing on model evals.
Tea
·
Does Arize offer also evaluation how good the model works in response to different prompts?
Vibhu S.
·
Came across this paper on task based evals with LLMs the other day and it’s on my list to go through - https://arxiv.org/pdf/2401.07103.pdf
Jason
·
Tea I think what you are asking is prompt template comparisons. Arize cloud does have a lot of support around that, A/B comparisons.
Jason
·
Vibhu S. Will check out the paper. Funny enough I just saw someone post this same paper on twitter was bookmarking it. We have some really good research dropping tomorrow from Aparna D. on Numeric Score Evals and how LLMs struggle with continuous range estimation.
👍1
🔥1