Understanding QA Correctness vs. Hallucination Evaluation | Arize AI Community

Arize AI Community Icon

Mikyo
·
Hey Trantor D. I think they are similar in nature but subtly different in that hallucination detects when an LLM strays from certain context where as QA correctness determines if the user question answers the question. They are benchmarked against different datasets: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/hallucinations
Trantor D.
·
For example - its possible for something to not be a hallucination (it actually comes from the docs) but does NOT actually answer the users question
Trantor D.
·
What I cant see, is if QACorrect is TRUE, meaning answer:
comes from docs and
does answer question,
I cant actually see Hallucination ever being true...
Mikyo
·
In that above scenario I think you a re right
👍1
Trantor D.
·
I see - so its only when QACorrectness is False, that Hallucination checks gives us extra information
Mikyo
·
Yeah admittedly the fact that both of these evals are rooted in reference context makes them a bit similar in nature. I think we probably will make a more generic QA correctness eval that would mean you can see if the hallucination might actually be answering the question un-grounded. I think that's kinda where BYOE (bring your own evals) comes into play where depending on your task you might want to build out your own set of benchmarks as the pre-tested ones might not fit your needs (https://docs.arize.com/phoenix/evaluation/how-to-evals/bring-your-own-evaluator)
🙌1
Trantor D.
·
Yeah will definitely implement my own as well - just wanted to start with some baselines to make sure everything works "end-to-end" before adding the complexity of our own evals lol
Mikyo
·
Totally.
Teodor C.
·
Trantor D. / Mikyo I can add some extra input here because I ran into the same confusion a while ago and my app wrongly answering the question gave me the proper clarification (at least how I like to look at the difference between QA Correctness and Halluncinations):
lets say my question is this: "how many apples are in the green basket?
I have 3 documents that are retrieved with the answer to this question, one states that there are 2 apples, one states that there are 5 apples and another one states that there are 9 apples.
the most recent and accurate document is the one stating that there are 9 apples in the green basket
If the answer from the LLM based on the retrieved documents is 2, 5 or 9, the answer is actually not hallucinated because is grounded in the retrieved documentation However, if the answer is not 9 apples, the answer is not correctly answered even though is grounded in the retrieved documents. This, at least, is the way I understand the difference.