no worries Dustin! appreciate any insight whenever you get a chance to look :)
Hey Dustin N., Coming with a little update here. I was trying to adapt the previous RAG Relevance evaluation that I used from Arize to the new format with datasets & experiments. Here is how I upload my dataset, probably is not the best way to do it (I couldn't get the output keys to work in any shape or form):
dataset = px.Client().upload_dataset(
dataframe=full_documents_with_relevance_df,
input_keys=['parent_id', 'input', 'reference', 'eval_label', 'eval_score', 'eval_explanation'],
output_keys=[],
dataset_name='retrieval-evaluation-dataset',
)I tried to define the evaluators in the following way, I'd appreciate if you could have a look and see if you spot anything that is weird:
import numpy as np
from sklearn.metrics import ndcg_score
from typing import Dict, Any
import pandas as pd
# Define k as a global variable
K = 10 # You can change this value as needed
def prepare_data(input_data: Dict[str, Any]) -> pd.DataFrame:
df = pd.DataFrame([input_data])
if 'eval_score' not in df.columns:
raise ValueError("'eval_score' not found in input data")
if 'document_position' not in df.columns:
df['document_position'] = range(len(df))
return df
def calculate_ndcg(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for _, group in grouped:
n = max(K, len(group))
eval_scores = np.zeros(n)
doc_scores = np.zeros(n)
eval_scores[:len(group)] = group['eval_score']
doc_scores[:len(group)] = range(1, len(group) + 1)[::-1] # Reverse rank as score
try:
result = ndcg_score([eval_scores], [doc_scores], k=K)
except ValueError:
result = np.nan
results.append(result)
return np.nanmean(results)
def calculate_precision_at_k(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
result = group['eval_score'].head(K).sum() / K
results.append(result)
return np.nanmean(results)
def calculate_hit_rate(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
result = int(group['eval_score'].head(K).sum() > 0)
results.append(result)
return np.nanmean(results)
def calculate_recall_at_k(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
relevant_at_k = group['eval_score'].head(K).sum()
total_relevant = group['eval_score'].sum()
result = relevant_at_k / total_relevant if total_relevant > 0 else 0
results.append(result)
return np.nanmean(results)
def ndcg_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_ndcg(df)
return result
def precision_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_precision_at_k(df)
return result
def hit_rate_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_hit_rate(df)
return result
def recall_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_recall_at_k(df)
return resultI am running my experiments like this:
run_experiment(
dataset,
task=lambda x: x, # Identity function as we're not modifying the input
evaluators=[
ndcg_evaluator,
precision_at_k_evaluator,
hit_rate_evaluator,
recall_at_k_evaluator
],
experiment_name="retrieval_evaluation",
experiment_metadata={"retrieval_system": "Evaluate the retrieval for the original user query", "k": K},
)And the results are something like this (which is a little bit surprising that ndcg, hit rate and recall are always identical in value):
馃敆 View this experiment: http://localhost:6007/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDoy
Experiment Summary (07/18/24 01:25 PM +0000)
--------------------------------------------
evaluator n n_scores avg_score
0 hit_rate_evaluator 451 451 0.886918
1 ndcg_evaluator 451 451 0.886918
2 precision_at_k_evaluator 451 451 0.088692
3 recall_at_k_evaluator 451 451 0.886918
Tasks Summary (07/18/24 01:22 PM +0000)
---------------------------------------
n_examples n_runs n_errors
0 451 451 0
So basically my requirement is more or less straigthforward: every time I make changes to the retrieval part ( prompt, embeddings, new docs, chunk size etc), I am exporting the new trace file and run the evaluators against the new traces. Does it make sense? Plus, do you suggest a different and maybe a more optimal way of doing this? BTW, my next step is to do the same but for the hallucination & correctness part. Thanks in advance and sorry for the ultra long message 馃槶
hello! first of all, amazing thing with the datasets&experiments - really awesome stuff 鉂わ笍 . Question - do you guys have any examples of how would the retrieval evaluation or QA / Hallucination evaluations would be integrated with the experiments ? curious to see an example of how would you do it or any suggestion that you might have 馃檪
thanks. i ll update the thread once i get it running:)
so i guess the question is more around where do you set the hostname for the litellm machine. for example if you use ollama you set an env variable, is it the same variable ?
we use litellm with presidio for pii masking :)
Hello Team, I'm trying to use Claude Sonnet 3.5 via LiteLLM. So far the only tutorial I found is regarding ollama running locally. How would one define the LiteLLM and the RelevanceEvaluator to point to a different machine running the litellm endpoint? Basically the content from here: https://docs.arize.com/phoenix/api/evaluation-models#litellmmodel
will do, thanks !
seems to be working Roger Y.! thanks a lot as always! one additional question ( i can raise a separate thread if needed ) - I'm trying to use Claude Sonnet 3.5 via LiteLLM. So far the only tutorial I found is regarding ollama running locally. How would one define the LiteLLM and the RelevanceEvaluator to point to a LiteLLM machine ?
see DM
