What object is the best to return from Evaluator function if evaluation is meaningless (for example there were no answer, so qa_correctness is meaningless) Currently I return None, but this result in error
💡 Hint: Mention RunLLM in the thread for followups.
💡 Hint: Mention RunLLM in the thread for followups.
Hi Igor B. yeah I’d make a separate label like N/A or “no answer” for the metric. We definitely try our best to summarize the results in the UI depending on how evaluations are distributed but if something isn’t working, please file an issue because we’re actively working on improving the experiment UI right now
💡 Hint: Mention RunLLM in the thread for followups.
💡 Hint: Mention RunLLM in the thread for followups.
because you should be able to accomplish this by just returning "N/A" as a string
@create_evaluator(kind="llm") # need the decorator or the kind will default to "code"
def qa_correctness(input, output, expected):
message_content = CUSTOM_QA_PROMPT_TEMPLATE_WITH_EXPLANATION.format(
input=input["scenario"], reference=expected["answer"], output=output
)
response_message_content = chat_completion_request_openai(message_content)
reasoning = re.findall(r"<reasoning>(.*?)</reasoning>", response_message_content, re.DOTALL)
if not isinstance(reasoning, list) or not reasoning:
reasoning = response_message_content
else:
reasoning = reasoning[0].strip()
score = re.findall(r"<score>(.*?)</score>", response_message_content, re.DOTALL)
if not isinstance(score, list) or not score:
raise ValueError('score not found in response message content')
else:
score = round(float(score[0]) / 100.0, 2)
label = re.findall(r"<label>(.*?)</label>", response_message_content, re.DOTALL)
if not isinstance(label, list) or not label:
label = None
else:
label = label[0]
if label is not None:
return score, label, reasoning
return scoreit is simple llm evaluator
but I want to distinguish experiments
qa_correctness is 90% and all examples have been evaluated and
qa_correctness is 90% and only 10% have been evaluated, while others fail
is setting the label on all the scores insufficient to determine that?
We may see on UI visualisation of experiments, which look similar on first glance, but one of experiments may have had 90% fails in evaluations and not provided scores
