I am currently trying to calculate evals of recall, precision and f1 score for a binary classification of our LLM outputs on a dataset but am not sure what makes the most sense using Phoenix. For example, for recall I attempted to build a custom eval metric comparing the output and expected, but the issue is that I can't output a None value for those examples with a ground truth labels of negative (since we only care about positive values in recall) without causing an error. Let me know what y'all think, ty!
To calculate recall, precision, and F1 score for binary classification in Phoenix, you can use the ClassificationEval class from the Phoenix library, which wraps around sklearn's metrics. This class allows you to handle binary classification by specifying a positive label and automatically processes missing values. For recall, you can use the recall_score method, which calculates the recall by focusing on the positive class. If you encounter issues with None values, ensure that your data is preprocessed to handle missing values or use the _eliminate_missing_values_from_all_series function to clean your data before evaluation. Would you like more detail on how to implement this in Phoenix?Sources:
- Phoenix metrics wrappers
馃挕聽Hint: Mention RunLLM in the thread for followups.
RunLLM so would I pass this into the run_experiment method using the parameter evaluators (i.e. `run_experiment(evaluators=[ClassificationEval]) or what is the proper way to implement this?
To implement custom evaluation metrics like recall, precision, and F1 score in Phoenix using the run_experiment method, you should define your custom evaluators and pass them as a list to the evaluators parameter. Here's a concise breakdown: 1. Define Custom Evaluators: Create functions for recall, precision, and F1 score that take the model's output and the expected output as inputs. These functions should return the calculated metric.2. Use run_experiment: Pass these custom evaluator functions to the run_experiment method using the evaluators parameter.Here's a basic example: LBKSfrom phoenix.experiments import run_experimentLBKSdef recall_evaluator(output, expected): # Implement recall calculation passLBKSdef precision_evaluator(output, expected): # Implement precision calculation passLBKSdef f1_score_evaluator(output, expected): # Implement F1 score calculation passLBKSexperiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] )LBKSWould you like more detail on any specific part?Sources:
- https://github.com/Arize-ai/phoenix/issues/4289
- https://github.com/arize-ai/phoenix/blob/main/src/phoenix/experiments/functions.py
馃挕聽Hint: Mention RunLLM in the thread for followups.
RunLLM but that is the issue. in order to calculate these metrics you either need A) a list of the outputs and expected values at once to get both the numerator and the denominator or B) you would need to be able to return a 0, 1, or None on an individual row by row basis so you can nullify rows that don't count towards the denominator in those metrics.
To calculate metrics like recall, precision, and F1 score in Phoenix, you can handle the evaluation on a row-by-row basis by returning a score of 0 or 1 for each row, and then aggregate these scores to compute the final metrics. Here's how you can approach it: 1. Row-by-Row Evaluation: For each row, return a score of 1 if the condition for the metric is met (e.g., true positive for recall), and 0 otherwise. You can use a custom evaluator function to achieve this.2. Aggregate Scores: After evaluating each row, aggregate the scores to compute the final metric. For recall, sum the true positives and divide by the total number of actual positives.Here's a basic example of how you might implement this: LBKSdef recall_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == expected == "positive" else 0LBKSdef precision_evaluator(output, expected): # Return 1 for true positive, 0 otherwise return 1 if output == "positive" and expected == "positive" else 0LBKSdef f1_score_evaluator(output, expected): # Implement F1 score calculation based on precision and recall passLBKS# Run experiment with custom evaluators experiment = run_experiment( dataset=your_dataset, task=your_task, evaluators=[recall_evaluator, precision_evaluator, f1_score_evaluator] )LBKSWould you like more detail on any specific part?Sources:
- https://github.com/arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py
- https://arize-ai.slack.com/archives/C04R3GXC8HK/p1725442860.605379
馃挕聽Hint: Mention RunLLM in the thread for followups.
I am going to wait until I get a response from the team since that answer is incorrect
need A) a list of the outputs and expected values at once to get both the numerator and the denominator
currently Phoenix doesn鈥檛 support aggregated eval metrics, but you can use the code below to retrieve the data you need
from httpx import Client
from typing import Any
import pandas as pd
query = """
query ($experimentId: GlobalID!, $after: String) {
node (id: $experimentId) {
... on Experiment {
id
name
runs (first: 1000, after: $after) {
pageInfo {
endCursor
hasNextPage
}
edges {
node {
output
example {
revision {
output
}
id
}
}
}
}
}
}
}
"""
def get_df(runs: dict[str, Any]) -> pd.DataFrame:
return pd.json_normalize(e for e in runs["edges"])
def get_experiment_runs(experiment_id: str, base_url: str = "http://localhost:6006/") -> pd.DataFrame:
dfs = []
has_next_page, after = True, ""
while has_next_page:
variables = {
"experimentId": experiment_id,
}
if after:
variables["after"] = after
response = Client(base_url=base_url).post(
"/graphql",
json={
"query": query,
"variables": variables,
},
)
runs = response.json()["data"]["node"]["runs"]
if not (df := get_df(runs)).empty:
dfs.append(df)
has_next_page = runs["pageInfo"]["hasNextPage"]
after = runs["pageInfo"]["endCursor"]
return pd.concat(dfs)
experiment_id = "RXhwZXJpbWVudDoz" # replace with your experiment id
df = get_experiment_runs(experiment_id)
df
as demonstration, here鈥檚 my output from running the code above
