hello! first of all, amazing thing with the datasets&experiments - really awesome stuff 鉂わ笍 . Question - do you guys have any examples of how would the retrieval evaluation or QA / Hallucination evaluations would be integrated with the experiments ? curious to see an example of how would you do it or any suggestion that you might have 馃檪
hi Teodor! Thanks so much for trying out datasets and experiments and we're so glad you like them. In order to build them out we revisited the concept of an an evaluator, and as of this moment we haven't reconciled our existing phoenix.evals evaluators with the evaluators used for experiments. This is something we really hope to get to in the near future, and if you have any thoughts about what you'd like to see please file a ticket. For the time being, if you give an example of the kind of evaluator you'd like to implement (with an example experiment task) We can try to show you some examples of how we'd implement it for experiments
Hey Dustin N., Coming with a little update here. I was trying to adapt the previous RAG Relevance evaluation that I used from Arize to the new format with datasets & experiments. Here is how I upload my dataset, probably is not the best way to do it (I couldn't get the output keys to work in any shape or form):
dataset = px.Client().upload_dataset(
dataframe=full_documents_with_relevance_df,
input_keys=['parent_id', 'input', 'reference', 'eval_label', 'eval_score', 'eval_explanation'],
output_keys=[],
dataset_name='retrieval-evaluation-dataset',
)I tried to define the evaluators in the following way, I'd appreciate if you could have a look and see if you spot anything that is weird:
import numpy as np
from sklearn.metrics import ndcg_score
from typing import Dict, Any
import pandas as pd
# Define k as a global variable
K = 10 # You can change this value as needed
def prepare_data(input_data: Dict[str, Any]) -> pd.DataFrame:
df = pd.DataFrame([input_data])
if 'eval_score' not in df.columns:
raise ValueError("'eval_score' not found in input data")
if 'document_position' not in df.columns:
df['document_position'] = range(len(df))
return df
def calculate_ndcg(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for _, group in grouped:
n = max(K, len(group))
eval_scores = np.zeros(n)
doc_scores = np.zeros(n)
eval_scores[:len(group)] = group['eval_score']
doc_scores[:len(group)] = range(1, len(group) + 1)[::-1] # Reverse rank as score
try:
result = ndcg_score([eval_scores], [doc_scores], k=K)
except ValueError:
result = np.nan
results.append(result)
return np.nanmean(results)
def calculate_precision_at_k(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
result = group['eval_score'].head(K).sum() / K
results.append(result)
return np.nanmean(results)
def calculate_hit_rate(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
result = int(group['eval_score'].head(K).sum() > 0)
results.append(result)
return np.nanmean(results)
def calculate_recall_at_k(df: pd.DataFrame) -> float:
global K
grouped = df.groupby('parent_id')
results = []
for name, group in grouped:
relevant_at_k = group['eval_score'].head(K).sum()
total_relevant = group['eval_score'].sum()
result = relevant_at_k / total_relevant if total_relevant > 0 else 0
results.append(result)
return np.nanmean(results)
def ndcg_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_ndcg(df)
return result
def precision_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_precision_at_k(df)
return result
def hit_rate_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_hit_rate(df)
return result
def recall_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
df = prepare_data(input)
result = calculate_recall_at_k(df)
return resultI am running my experiments like this:
run_experiment(
dataset,
task=lambda x: x, # Identity function as we're not modifying the input
evaluators=[
ndcg_evaluator,
precision_at_k_evaluator,
hit_rate_evaluator,
recall_at_k_evaluator
],
experiment_name="retrieval_evaluation",
experiment_metadata={"retrieval_system": "Evaluate the retrieval for the original user query", "k": K},
)And the results are something like this (which is a little bit surprising that ndcg, hit rate and recall are always identical in value):
馃敆 View this experiment: http://localhost:6007/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDoy
Experiment Summary (07/18/24 01:25 PM +0000)
--------------------------------------------
evaluator n n_scores avg_score
0 hit_rate_evaluator 451 451 0.886918
1 ndcg_evaluator 451 451 0.886918
2 precision_at_k_evaluator 451 451 0.088692
3 recall_at_k_evaluator 451 451 0.886918
Tasks Summary (07/18/24 01:22 PM +0000)
---------------------------------------
n_examples n_runs n_errors
0 451 451 0
So basically my requirement is more or less straigthforward: every time I make changes to the retrieval part ( prompt, embeddings, new docs, chunk size etc), I am exporting the new trace file and run the evaluators against the new traces. Does it make sense? Plus, do you suggest a different and maybe a more optimal way of doing this? BTW, my next step is to do the same but for the hallucination & correctness part. Thanks in advance and sorry for the ultra long message 馃槶
Hey I鈥檓 so sorry for the delayed response, I鈥檝e given this a pass but probably need a little bit to think about this. I鈥檒l respond here with an idea!
no worries Dustin! appreciate any insight whenever you get a chance to look :)
hey Teodor C. thanks for the patience! I think coming up with patterns for evaluating retrieval is going to be a super important use-case, so I hope we can find a solution that works well for you. I think I'd personally make the retrieval process itself the task, instead of relying on tracing the entire application at once. Each experiment can track the performance of different retrieval tasks (either with different settings, techniques, etc). The dataset would either be: an input (query) that you use for retrieval, data to seed your corpus, or both! You can then define a task that simply does your retrieval step:
def retrieval_task(input):
corpus = input['corpus']
query = input['query']
# set up retrieval based on the corpus (e.g. a vector store)
# run retrieval using the query
return {"retrieved_documents": [...]}Of course, if setting up your vector store (or what have you) is an expensive process, you can imagine leaving that step out
also, just FYI, the evaluator functions you pass in are very flexible wrt to the call signature, you don't need to define them with an output argument if you don't plan on using the output in the eval, though in the example I outlined you probably would be
With this method, you can imagine storing a relevance-ordered list of retrieved documents in the output of each dataset example, allowing you to calculate metrics against what you get from each retrieval task
