Best Approach for Returning Values from Evaluator Function

·Aug 13, 2025 02:00 PM

What object is the best to return from Evaluator function if evaluation is meaningless (for example there were no answer, so qa_correctness is meaningless) Currently I return None, but this result in error

31 comments

· Sorted by Oldest

RunLLM
APP
·
You should not return None from an Evaluator function—this will cause an error. Instead, return an EvaluationResult object with appropriate fields set (e.g., label="N/A", score=None, explanation="No answer provided") or a dictionary with these keys. This ensures Phoenix can handle and display the result gracefully without raising errors. Returning None is not supported as a valid evaluation result type according to the documented evaluator output formats and codebase logic (<https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment|source>, <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py|source>). Would you like a code example or more detail on how to construct the EvaluationResult for this case? Sources: - <https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment|https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment> - <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py|https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
RunLLM is there a separate mertic which shows number of N/A labels/None scores in experiment, so I can understand if precision is meaningful?
RunLLM
APP
·
It could be possible that Phoenix does not provide a built-in metric that directly counts the number of "N/A" labels or None scores in experiments. The documentation and code examples focus on aggregating valid evaluation results (e.g., precision, recall) and do not mention a separate metric for tracking the count of unscored or "N/A" evaluations (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>, <https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation|source>). You may need to compute this count manually from the evaluation results dataframe. Would you like more detail or a code snippet for how to count "N/A" or None results in your experiment output? Sources: - <https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation|https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
Hi Igor B. yeah I’d make a separate label like N/A or “no answer” for the metric. We definitely try our best to summarize the results in the UI depending on how evaluations are distributed but if something isn’t working, please file an issue because we’re actively working on improving the experiment UI right now
🙌1
Igor B.
·
RunLLM please give a code snippet for how to count "N/A" or None results in your experiment output?
RunLLM
APP
·
It seems likely that you can count "N/A" or None results in your experiment output by filtering the evaluation results DataFrame for those values. For example, if your evaluation DataFrame has a label column, you can use pandas to count occurrences: ``` import pandas as pd # Assume eval_df is your evaluation results DataFrame num_na = eval_df['label'].isin(['N/A', 'no answer']).sum() num_none = eval_df['score'].isna().sum() print(f"Number of N/A labels: {num_na}") print(f"Number of None scores: {num_none}") ``` This approach is based on standard pandas usage and is consistent with Phoenix's evaluation result structures (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>). Would you like more detail or help adapting this to your specific experiment output? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/tests/unit/trace/dsl/test_query.py|https://github.com/Arize-ai/phoenix/blob/main/tests/unit/trace/dsl/test_query.py> - <https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-experiments/using-evaluators.md|https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-experiments/using-evaluators.md>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
RunLLM EvaluationResult(score=None, label = 'N/A')
RunLLM
APP
·
Phoenix does not currently provide a built-in way to display the count of EvaluationResult(score=None, label='N/A') as a separate metric in the experiment run UI. The UI summarizes evaluation results and shows distributions, but custom counts like "N/A" must be computed manually from the results DataFrame (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>, <https://github.com/Arize-ai/phoenix/issues/1267|source>). There is no documented API or UI feature to add a custom "N/A count" metric to the experiment dashboard. Would you like a workaround or more detail on how to surface this information outside the UI? Sources: - <https://github.com/Arize-ai/phoenix/issues/1267|https://github.com/Arize-ai/phoenix/issues/1267>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Igor B. can you show an example of the function you're passing in as an evaluator?
Dustin N.
·
because you should be able to accomplish this by just returning "N/A" as a string

Igor B.

@create_evaluator(kind="llm")  # need the decorator or the kind will default to "code"
def qa_correctness(input, output, expected):
    message_content = CUSTOM_QA_PROMPT_TEMPLATE_WITH_EXPLANATION.format(
        input=input["scenario"], reference=expected["answer"], output=output
    )
    response_message_content = chat_completion_request_openai(message_content)

    reasoning = re.findall(r"<reasoning>(.*?)</reasoning>", response_message_content, re.DOTALL)
    if not isinstance(reasoning, list) or not reasoning:
        reasoning = response_message_content
    else:
        reasoning = reasoning[0].strip()
    
    score = re.findall(r"<score>(.*?)</score>", response_message_content, re.DOTALL)
    if not isinstance(score, list) or not score:
        raise ValueError('score not found in response message content')
    else:
        score = round(float(score[0]) / 100.0, 2)
    label = re.findall(r"<label>(.*?)</label>", response_message_content, re.DOTALL)
    if not isinstance(label, list) or not label:
        label = None
    else:
        label = label[0]

    if label is not None:
        return score, label, reasoning
    return score

Igor B.
·
it is simple llm evaluator
Igor B.
·
but I want to distinguish experiments
qa_correctness is 90% and all examples have been evaluated and
qa_correctness is 90% and only 10% have been evaluated, while others fail
Dustin N.
·
is setting the label on all the scores insufficient to determine that?
Igor B.
·
We may see on UI visualisation of experiments, which look similar on first glance, but one of experiments may have had 90% fails in evaluations and not provided scores

RunLLM
APP
·
You should not return None from an Evaluator function—this will cause an error. Instead, return an EvaluationResult object with appropriate fields set (e.g., label="N/A", score=None, explanation="No answer provided") or a dictionary with these keys. This ensures Phoenix can handle and display the result gracefully without raising errors. Returning None is not supported as a valid evaluation result type according to the documented evaluator output formats and codebase logic (<https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment|source>, <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py|source>). Would you like a code example or more detail on how to construct the EvaluationResult for this case? Sources: - <https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment|https://arize-phoenix.readthedocs.io/en/latest/api/experiments.html#experiments.functions.run_experiment> - <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py|https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/experiments/evaluators/utils.py>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
RunLLM is there a separate mertic which shows number of N/A labels/None scores in experiment, so I can understand if precision is meaningful?
RunLLM
APP
·
It could be possible that Phoenix does not provide a built-in metric that directly counts the number of "N/A" labels or None scores in experiments. The documentation and code examples focus on aggregating valid evaluation results (e.g., precision, recall) and do not mention a separate metric for tracking the count of unscored or "N/A" evaluations (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>, <https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation|source>). You may need to compute this count manually from the evaluation results dataframe. Would you like more detail or a code snippet for how to count "N/A" or None results in your experiment output? Sources: - <https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation|https://arize.com/docs/phoenix/learn/evaluation#retrieval-evaluation>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
Hi Igor B. yeah I’d make a separate label like N/A or “no answer” for the metric. We definitely try our best to summarize the results in the UI depending on how evaluations are distributed but if something isn’t working, please file an issue because we’re actively working on improving the experiment UI right now
🙌1
Igor B.
·
RunLLM please give a code snippet for how to count "N/A" or None results in your experiment output?
RunLLM
APP
·
It seems likely that you can count "N/A" or None results in your experiment output by filtering the evaluation results DataFrame for those values. For example, if your evaluation DataFrame has a label column, you can use pandas to count occurrences: ``` import pandas as pd # Assume eval_df is your evaluation results DataFrame num_na = eval_df['label'].isin(['N/A', 'no answer']).sum() num_none = eval_df['score'].isna().sum() print(f"Number of N/A labels: {num_na}") print(f"Number of None scores: {num_none}") ``` This approach is based on standard pandas usage and is consistent with Phoenix's evaluation result structures (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>). Would you like more detail or help adapting this to your specific experiment output? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/tests/unit/trace/dsl/test_query.py|https://github.com/Arize-ai/phoenix/blob/main/tests/unit/trace/dsl/test_query.py> - <https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-experiments/using-evaluators.md|https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-experiments/using-evaluators.md>
💡 Hint: Mention RunLLM in the thread for followups.
Igor B.
·
RunLLM EvaluationResult(score=None, label = 'N/A')
RunLLM
APP
·
Phoenix does not currently provide a built-in way to display the count of EvaluationResult(score=None, label='N/A') as a separate metric in the experiment run UI. The UI summarizes evaluation results and shows distributions, but custom counts like "N/A" must be computed manually from the results DataFrame (<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators|source>, <https://github.com/Arize-ai/phoenix/issues/1267|source>). There is no documented API or UI feature to add a custom "N/A count" metric to the experiment dashboard. Would you like a workaround or more detail on how to surface this information outside the UI? Sources: - <https://github.com/Arize-ai/phoenix/issues/1267|https://github.com/Arize-ai/phoenix/issues/1267>
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Igor B. can you show an example of the function you're passing in as an evaluator?
Dustin N.
·
because you should be able to accomplish this by just returning "N/A" as a string

Igor B.

@create_evaluator(kind="llm")  # need the decorator or the kind will default to "code"
def qa_correctness(input, output, expected):
    message_content = CUSTOM_QA_PROMPT_TEMPLATE_WITH_EXPLANATION.format(
        input=input["scenario"], reference=expected["answer"], output=output
    )
    response_message_content = chat_completion_request_openai(message_content)

    reasoning = re.findall(r"<reasoning>(.*?)</reasoning>", response_message_content, re.DOTALL)
    if not isinstance(reasoning, list) or not reasoning:
        reasoning = response_message_content
    else:
        reasoning = reasoning[0].strip()
    
    score = re.findall(r"<score>(.*?)</score>", response_message_content, re.DOTALL)
    if not isinstance(score, list) or not score:
        raise ValueError('score not found in response message content')
    else:
        score = round(float(score[0]) / 100.0, 2)
    label = re.findall(r"<label>(.*?)</label>", response_message_content, re.DOTALL)
    if not isinstance(label, list) or not label:
        label = None
    else:
        label = label[0]

    if label is not None:
        return score, label, reasoning
    return score

Igor B.
·
it is simple llm evaluator
Igor B.
·
but I want to distinguish experiments
qa_correctness is 90% and all examples have been evaluated and
qa_correctness is 90% and only 10% have been evaluated, while others fail
Dustin N.
·
is setting the label on all the scores insufficient to determine that?
Igor B.
·
We may see on UI visualisation of experiments, which look similar on first glance, but one of experiments may have had 90% fails in evaluations and not provided scores