I am currently trying to calculate evals of recall, precision and f1 score for a binary classification of our LLM outputs on a dataset but am not sure what makes the most sense using Phoenix. For example, for recall I attempted to build a custom eval metric comparing the output and expected, but the issue is that I can't output a None value for those examples with a ground truth labels of negative (since we only care about positive values in recall) without causing an error. Let me know what y'all think, ty!
馃挕聽Hint: Mention RunLLM in the thread for followups.
RunLLM so would I pass this into the run_experiment method using the parameter evaluators (i.e. `run_experiment(evaluators=[ClassificationEval]) or what is the proper way to implement this?
馃挕聽Hint: Mention RunLLM in the thread for followups.
RunLLM but that is the issue. in order to calculate these metrics you either need A) a list of the outputs and expected values at once to get both the numerator and the denominator or B) you would need to be able to return a 0, 1, or None on an individual row by row basis so you can nullify rows that don't count towards the denominator in those metrics.
馃挕聽Hint: Mention RunLLM in the thread for followups.
I am going to wait until I get a response from the team since that answer is incorrect
need A) a list of the outputs and expected values at once to get both the numerator and the denominator
currently Phoenix doesn鈥檛 support aggregated eval metrics, but you can use the code below to retrieve the data you need
from httpx import Client
from typing import Any
import pandas as pd
query = """
query ($experimentId: GlobalID!, $after: String) {
node (id: $experimentId) {
... on Experiment {
id
name
runs (first: 1000, after: $after) {
pageInfo {
endCursor
hasNextPage
}
edges {
node {
output
example {
revision {
output
}
id
}
}
}
}
}
}
}
"""
def get_df(runs: dict[str, Any]) -> pd.DataFrame:
return pd.json_normalize(e for e in runs["edges"])
def get_experiment_runs(experiment_id: str, base_url: str = "http://localhost:6006/") -> pd.DataFrame:
dfs = []
has_next_page, after = True, ""
while has_next_page:
variables = {
"experimentId": experiment_id,
}
if after:
variables["after"] = after
response = Client(base_url=base_url).post(
"/graphql",
json={
"query": query,
"variables": variables,
},
)
runs = response.json()["data"]["node"]["runs"]
if not (df := get_df(runs)).empty:
dfs.append(df)
has_next_page = runs["pageInfo"]["hasNextPage"]
after = runs["pageInfo"]["endCursor"]
return pd.concat(dfs)
experiment_id = "RXhwZXJpbWVudDoz" # replace with your experiment id
df = get_experiment_runs(experiment_id)
df
as demonstration, here鈥檚 my output from running the code above
