What object is the best to return from Evaluator function if evaluation is meaningless (for example there were no answer, so qa_correctness is meaningless) Currently I return None, but this result in error
You should not return None from an Evaluator function—this will cause an error. Instead, return an EvaluationResult object with appropriate fields set (e.g., label="N/A", score=None, explanation="No answer provided") or a dictionary with these keys. This ensures Phoenix can handle and display the result gracefully without raising errors. Returning None is not supported as a valid evaluation result type according to the documented evaluator output formats and codebase logic (source, source).Would you like a code example or more detail on how to construct the EvaluationResult for this case?Sources: - https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment - https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py
💡 Hint: Mention RunLLM in the thread for followups.
It could be possible that Phoenix does not provide a built-in metric that directly counts the number of "N/A" labels or None scores in experiments. The documentation and code examples focus on aggregating valid evaluation results (e.g., precision, recall) and do not mention a separate metric for tracking the count of unscored or "N/A" evaluations (source, source). You may need to compute this count manually from the evaluation results dataframe.Would you like more detail or a code snippet for how to count "N/A" or None results in your experiment output?Sources: - https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation
💡 Hint: Mention RunLLM in the thread for followups.
Hi Igor B. yeah I’d make a separate label like N/A or “no answer” for the metric. We definitely try our best to summarize the results in the UI depending on how evaluations are distributed but if something isn’t working, please file an issue because we’re actively working on improving the experiment UI right now
It seems likely that you can count "N/A" or None results in your experiment output by filtering the evaluation results DataFrame for those values. For example, if your evaluation DataFrame has a label column, you can use pandas to count occurrences: LBKSimport pandas as pdLBKS# Assume eval_df is your evaluation results DataFrame num_na = eval_df['label'].isin(['N/A', 'no answer']).sum() num_none = eval_df['score'].isna().sum() print(f"Number of N/A labels: {num_na}") print(f"Number of None scores: {num_none}")LBKSThis approach is based on standard pandas usage and is consistent with Phoenix's evaluation result structures (source).Would you like more detail or help adapting this to your specific experiment output?Sources:
- https://github.com/Arize-ai/phoenix/blob/main/tests/unit/trace/dsl/test_query.py
- https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-experiments/using-evaluators.md
💡 Hint: Mention RunLLM in the thread for followups.
Phoenix does not currently provide a built-in way to display the count of EvaluationResult(score=None, label='N/A') as a separate metric in the experiment run UI. The UI summarizes evaluation results and shows distributions, but custom counts like "N/A" must be computed manually from the results DataFrame (source, source). There is no documented API or UI feature to add a custom "N/A count" metric to the experiment dashboard.Would you like a workaround or more detail on how to surface this information outside the UI?Sources: - https://github.com/Arize-ai/phoenix/issues/1267
💡 Hint: Mention RunLLM in the thread for followups.
because you should be able to accomplish this by just returning "N/A" as a string
@create_evaluator(kind="llm") # need the decorator or the kind will default to "code"
def qa_correctness(input, output, expected):
message_content = CUSTOM_QA_PROMPT_TEMPLATE_WITH_EXPLANATION.format(
input=input["scenario"], reference=expected["answer"], output=output
)
response_message_content = chat_completion_request_openai(message_content)
reasoning = re.findall(r"<reasoning>(.*?)</reasoning>", response_message_content, re.DOTALL)
if not isinstance(reasoning, list) or not reasoning:
reasoning = response_message_content
else:
reasoning = reasoning[0].strip()
score = re.findall(r"<score>(.*?)</score>", response_message_content, re.DOTALL)
if not isinstance(score, list) or not score:
raise ValueError('score not found in response message content')
else:
score = round(float(score[0]) / 100.0, 2)
label = re.findall(r"<label>(.*?)</label>", response_message_content, re.DOTALL)
if not isinstance(label, list) or not label:
label = None
else:
label = label[0]
if label is not None:
return score, label, reasoning
return scoreit is simple llm evaluator
but I want to distinguish experiments
qa_correctness is 90% and all examples have been evaluated and
qa_correctness is 90% and only 10% have been evaluated, while others fail
is setting the label on all the scores insufficient to determine that?
We may see on UI visualisation of experiments, which look similar on first glance, but one of experiments may have had 90% fails in evaluations and not provided scores
