Guidance on Calculating Recall, Precision, and F1 Score in Phoenix

·Feb 02, 2025 08:50 AM

I am currently trying to calculate evals of recall, precision and f1 score for a binary classification of our LLM outputs on a dataset but am not sure what makes the most sense using Phoenix. For example, for recall I attempted to build a custom eval metric comparing the output and expected, but the issue is that I can't output a None value for those examples with a ground truth labels of negative (since we only care about positive values in recall) without causing an error. Let me know what y'all think, ty!

9 comments

· Sorted by Oldest

RunLLM
APP
·
To calculate recall, precision, and F1 score for binary classification in Phoenix, you can use the `ClassificationEval` class from the Phoenix library, which wraps around sklearn's metrics. This class allows you to handle binary classification by specifying a positive label and automatically processes missing values. For recall, you can use the `recall_score` method, which calculates the recall by focusing on the positive class. If you encounter issues with `None` values, ensure that your data is preprocessed to handle missing values or use the `_eliminate_missing_values_from_all_series` function to clean your data before evaluation. Would you like more detail on how to implement this in Phoenix? Sources: - <https://github.com/arize-ai/phoenix/blob/main/src/phoenix/metrics/wrappers.py|Phoenix metrics wrappers>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
RunLLM so would I pass this into the run_experiment method using the parameter evaluators (i.e. `run_experiment(evaluators=[ClassificationEval]) or what is the proper way to implement this?
RunLLM
APP
·
To implement custom evaluation metrics like recall, precision, and F1 score in Phoenix using the `run_experiment` method, you should define your custom evaluators and pass them as a list to the `evaluators` parameter. Here's a concise breakdown: 1. *Define Custom Evaluators*: Create functions for recall, precision, and F1 score that take the model's output and the expected output as inputs. These functions should return the calculated metric. 2. *Use `run_experiment`*: Pass these custom evaluator functions to the `run_experiment` method using the `evaluators` parameter. Here's a basic example: ``` from phoenix.experiments import run_experiment def recall_evaluator(output, expected): # Implement recall calculation pass def precision_evaluator(output, expected): # Implement precision calculation pass def f1_score_evaluator(output, expected): # Implement F1 score calculation pass experiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] ) ``` Would you like more detail on any specific part? Sources: - <https://github.com/Arize-ai/phoenix/issues/4289|https://github.com/Arize-ai/phoenix/issues/4289> - <https://github.com/arize-ai/phoenix/blob/main/src/phoenix/experiments/functions.py|https://github.com/arize-ai/phoenix/blob/main/src/phoenix/experiments/functions.py>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
RunLLM but that is the issue. in order to calculate these metrics you either need A) a list of the outputs and expected values at once to get both the numerator and the denominator or B) you would need to be able to return a 0, 1, or None on an individual row by row basis so you can nullify rows that don't count towards the denominator in those metrics.
RunLLM
APP
·
To calculate metrics like recall, precision, and F1 score in Phoenix, you can handle the evaluation on a row-by-row basis by returning a score of 0 or 1 for each row, and then aggregate these scores to compute the final metrics. Here's how you can approach it: 1. *Row-by-Row Evaluation*: For each row, return a score of 1 if the condition for the metric is met (e.g., true positive for recall), and 0 otherwise. You can use a custom evaluator function to achieve this. 2. *Aggregate Scores*: After evaluating each row, aggregate the scores to compute the final metric. For recall, sum the true positives and divide by the total number of actual positives. Here's a basic example of how you might implement this: ``` def recall_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == expected == "positive" else 0 def precision_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == "positive" and expected == "positive" else 0 def f1_score_evaluator(output, expected): # Implement F1 score calculation based on precision and recall pass # Run experiment with custom evaluators experiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] ) ``` Would you like more detail on any specific part? Sources: - <https://github.com/arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py|https://github.com/arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py> - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
I am going to wait until I get a response from the team since that answer is incorrect

Roger Y.

need A) a list of the outputs and expected values at once to get both the numerator and the denominator

currently Phoenix doesn’t support aggregated eval metrics, but you can use the code below to retrieve the data you need

from httpx import Client
from typing import Any
import pandas as pd

query = """
query ($experimentId: GlobalID!, $after: String) {
  node (id: $experimentId) {
    ... on Experiment {
      id
      name
      runs (first: 1000, after: $after) {
        pageInfo {
          endCursor
          hasNextPage
        }
        edges {
          node {
            output
            example {
              revision {
                output
              }
              id
            }
          }
        }
      }
    }
  }
}
"""

def get_df(runs: dict[str, Any]) -> pd.DataFrame:
 return pd.json_normalize(e for e in runs["edges"])

def get_experiment_runs(experiment_id: str, base_url: str = "http://localhost:6006/") -> pd.DataFrame:
 dfs = []
 has_next_page, after = True, ""
 while has_next_page:
  variables = {
   "experimentId": experiment_id,
  }
  if after:
   variables["after"] = after
  response = Client(base_url=base_url).post(
   "/graphql",
   json={
    "query": query,
    "variables": variables,
   },
  )
  runs = response.json()["data"]["node"]["runs"]
  if not (df := get_df(runs)).empty:
   dfs.append(df)
  has_next_page = runs["pageInfo"]["hasNextPage"]
  after = runs["pageInfo"]["endCursor"]
 return pd.concat(dfs)

experiment_id = "RXhwZXJpbWVudDoz"  # replace with your experiment id

df = get_experiment_runs(experiment_id)
df

Roger Y.
·
as demonstration, here’s my output from running the code above
Tristan B.
·
Thanks Roger Y.! We will use this to get started on our end
👍1

RunLLM
APP
·
To calculate recall, precision, and F1 score for binary classification in Phoenix, you can use the `ClassificationEval` class from the Phoenix library, which wraps around sklearn's metrics. This class allows you to handle binary classification by specifying a positive label and automatically processes missing values. For recall, you can use the `recall_score` method, which calculates the recall by focusing on the positive class. If you encounter issues with `None` values, ensure that your data is preprocessed to handle missing values or use the `_eliminate_missing_values_from_all_series` function to clean your data before evaluation. Would you like more detail on how to implement this in Phoenix? Sources: - <https://github.com/arize-ai/phoenix/blob/main/src/phoenix/metrics/wrappers.py|Phoenix metrics wrappers>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
RunLLM so would I pass this into the run_experiment method using the parameter evaluators (i.e. `run_experiment(evaluators=[ClassificationEval]) or what is the proper way to implement this?
RunLLM
APP
·
To implement custom evaluation metrics like recall, precision, and F1 score in Phoenix using the `run_experiment` method, you should define your custom evaluators and pass them as a list to the `evaluators` parameter. Here's a concise breakdown: 1. *Define Custom Evaluators*: Create functions for recall, precision, and F1 score that take the model's output and the expected output as inputs. These functions should return the calculated metric. 2. *Use `run_experiment`*: Pass these custom evaluator functions to the `run_experiment` method using the `evaluators` parameter. Here's a basic example: ``` from phoenix.experiments import run_experiment def recall_evaluator(output, expected): # Implement recall calculation pass def precision_evaluator(output, expected): # Implement precision calculation pass def f1_score_evaluator(output, expected): # Implement F1 score calculation pass experiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] ) ``` Would you like more detail on any specific part? Sources: - <https://github.com/Arize-ai/phoenix/issues/4289|https://github.com/Arize-ai/phoenix/issues/4289> - <https://github.com/arize-ai/phoenix/blob/main/src/phoenix/experiments/functions.py|https://github.com/arize-ai/phoenix/blob/main/src/phoenix/experiments/functions.py>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
RunLLM but that is the issue. in order to calculate these metrics you either need A) a list of the outputs and expected values at once to get both the numerator and the denominator or B) you would need to be able to return a 0, 1, or None on an individual row by row basis so you can nullify rows that don't count towards the denominator in those metrics.
RunLLM
APP
·
To calculate metrics like recall, precision, and F1 score in Phoenix, you can handle the evaluation on a row-by-row basis by returning a score of 0 or 1 for each row, and then aggregate these scores to compute the final metrics. Here's how you can approach it: 1. *Row-by-Row Evaluation*: For each row, return a score of 1 if the condition for the metric is met (e.g., true positive for recall), and 0 otherwise. You can use a custom evaluator function to achieve this. 2. *Aggregate Scores*: After evaluating each row, aggregate the scores to compute the final metric. For recall, sum the true positives and divide by the total number of actual positives. Here's a basic example of how you might implement this: ``` def recall_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == expected == "positive" else 0 def precision_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == "positive" and expected == "positive" else 0 def f1_score_evaluator(output, expected): # Implement F1 score calculation based on precision and recall pass # Run experiment with custom evaluators experiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] ) ``` Would you like more detail on any specific part? Sources: - <https://github.com/arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py|https://github.com/arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py> - <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379|https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379>
💡 Hint: Mention RunLLM in the thread for followups.
Tristan B.
·
I am going to wait until I get a response from the team since that answer is incorrect

Roger Y.

need A) a list of the outputs and expected values at once to get both the numerator and the denominator

currently Phoenix doesn’t support aggregated eval metrics, but you can use the code below to retrieve the data you need

from httpx import Client
from typing import Any
import pandas as pd

query = """
query ($experimentId: GlobalID!, $after: String) {
  node (id: $experimentId) {
    ... on Experiment {
      id
      name
      runs (first: 1000, after: $after) {
        pageInfo {
          endCursor
          hasNextPage
        }
        edges {
          node {
            output
            example {
              revision {
                output
              }
              id
            }
          }
        }
      }
    }
  }
}
"""

def get_df(runs: dict[str, Any]) -> pd.DataFrame:
 return pd.json_normalize(e for e in runs["edges"])

def get_experiment_runs(experiment_id: str, base_url: str = "http://localhost:6006/") -> pd.DataFrame:
 dfs = []
 has_next_page, after = True, ""
 while has_next_page:
  variables = {
   "experimentId": experiment_id,
  }
  if after:
   variables["after"] = after
  response = Client(base_url=base_url).post(
   "/graphql",
   json={
    "query": query,
    "variables": variables,
   },
  )
  runs = response.json()["data"]["node"]["runs"]
  if not (df := get_df(runs)).empty:
   dfs.append(df)
  has_next_page = runs["pageInfo"]["hasNextPage"]
  after = runs["pageInfo"]["endCursor"]
 return pd.concat(dfs)

experiment_id = "RXhwZXJpbWVudDoz"  # replace with your experiment id

df = get_experiment_runs(experiment_id)
df

Roger Y.
·
as demonstration, here’s my output from running the code above
Tristan B.
·
Thanks Roger Y.! We will use this to get started on our end
👍1