Integrating Retrieval and QA Evaluations in Experiments

·Jul 17, 2024 04:34 PM

hello! first of all, amazing thing with the datasets&experiments - really awesome stuff ❤️ . Question - do you guys have any examples of how would the retrieval evaluation or QA / Hallucination evaluations would be integrated with the experiments ? curious to see an example of how would you do it or any suggestion that you might have 🙂

❤️1

7 comments

· Sorted by Oldest

Dustin N.
·
hi Teodor! Thanks so much for trying out datasets and experiments and we're so glad you like them. In order to build them out we revisited the concept of an an evaluator, and as of this moment we haven't reconciled our existing phoenix.evals evaluators with the evaluators used for experiments. This is something we really hope to get to in the near future, and if you have any thoughts about what you'd like to see please file a ticket. For the time being, if you give an example of the kind of evaluator you'd like to implement (with an example experiment task) We can try to show you some examples of how we'd implement it for experiments

Teodor C.

Hey Dustin N., Coming with a little update here. I was trying to adapt the previous RAG Relevance evaluation that I used from Arize to the new format with datasets & experiments. Here is how I upload my dataset, probably is not the best way to do it (I couldn't get the output keys to work in any shape or form):

dataset = px.Client().upload_dataset(
        dataframe=full_documents_with_relevance_df,
        input_keys=['parent_id', 'input', 'reference', 'eval_label', 'eval_score', 'eval_explanation'],
        output_keys=[],
        dataset_name='retrieval-evaluation-dataset',
    )

I tried to define the evaluators in the following way, I'd appreciate if you could have a look and see if you spot anything that is weird:

import numpy as np
from sklearn.metrics import ndcg_score
from typing import Dict, Any
import pandas as pd

# Define k as a global variable
K = 10  # You can change this value as needed

def prepare_data(input_data: Dict[str, Any]) -> pd.DataFrame:
    df = pd.DataFrame([input_data])
    
    if 'eval_score' not in df.columns:
        raise ValueError("'eval_score' not found in input data")
    
    if 'document_position' not in df.columns:
        df['document_position'] = range(len(df))
    
    return df

def calculate_ndcg(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for _, group in grouped:
        n = max(K, len(group))
        eval_scores = np.zeros(n)
        doc_scores = np.zeros(n)
        eval_scores[:len(group)] = group['eval_score']
        doc_scores[:len(group)] = range(1, len(group) + 1)[::-1]  # Reverse rank as score
        
        try:
            result = ndcg_score([eval_scores], [doc_scores], k=K)
        except ValueError:
            result = np.nan
        
        results.append(result)
    
    return np.nanmean(results)

def calculate_precision_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = group['eval_score'].head(K).sum() / K
        results.append(result)
    
    return np.nanmean(results)

def calculate_hit_rate(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = int(group['eval_score'].head(K).sum() > 0)
        results.append(result)
    
    return np.nanmean(results)

def calculate_recall_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        relevant_at_k = group['eval_score'].head(K).sum()
        total_relevant = group['eval_score'].sum()
        result = relevant_at_k / total_relevant if total_relevant > 0 else 0
        results.append(result)
    
    return np.nanmean(results)

def ndcg_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_ndcg(df)
    return result

def precision_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_precision_at_k(df)
    return result

def hit_rate_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_hit_rate(df)
    return result

def recall_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_recall_at_k(df)
    return result

I am running my experiments like this:

run_experiment(
    dataset,
    task=lambda x: x,  # Identity function as we're not modifying the input
    evaluators=[
        ndcg_evaluator,
        precision_at_k_evaluator,
        hit_rate_evaluator,
        recall_at_k_evaluator
    ],
    experiment_name="retrieval_evaluation",
    experiment_metadata={"retrieval_system": "Evaluate the retrieval for the original user query", "k": K},
)

And the results are something like this (which is a little bit surprising that ndcg, hit rate and recall are always identical in value):

🔗 View this experiment: http://localhost:6007/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDoy

Experiment Summary (07/18/24 01:25 PM +0000)
--------------------------------------------
                  evaluator    n  n_scores  avg_score
0        hit_rate_evaluator  451       451   0.886918
1            ndcg_evaluator  451       451   0.886918
2  precision_at_k_evaluator  451       451   0.088692
3     recall_at_k_evaluator  451       451   0.886918

Tasks Summary (07/18/24 01:22 PM +0000)
---------------------------------------
   n_examples  n_runs  n_errors
0         451     451         0

So basically my requirement is more or less straigthforward: every time I make changes to the retrieval part ( prompt, embeddings, new docs, chunk size etc), I am exporting the new trace file and run the evaluators against the new traces. Does it make sense? Plus, do you suggest a different and maybe a more optimal way of doing this? BTW, my next step is to do the same but for the hallucination & correctness part. Thanks in advance and sorry for the ultra long message 😭

❤️1

Dustin N.
·
Hey I’m so sorry for the delayed response, I’ve given this a pass but probably need a little bit to think about this. I’ll respond here with an idea!
Teodor C.
·
no worries Dustin! appreciate any insight whenever you get a chance to look :)
Dustin N.
·
hey Teodor C. thanks for the patience! I think coming up with patterns for evaluating retrieval is going to be a super important use-case, so I hope we can find a solution that works well for you. I think I'd personally make the retrieval process itself the task, instead of relying on tracing the entire application at once. Each experiment can track the performance of different retrieval tasks (either with different settings, techniques, etc). The dataset would either be: an input (query) that you use for retrieval, data to seed your corpus, or both! You can then define a task that simply does your retrieval step:
def retrieval_task(input): corpus = input['corpus'] query = input['query'] # set up retrieval based on the corpus (e.g. a vector store) # run retrieval using the query return {"retrieved_documents": [...]}
Of course, if setting up your vector store (or what have you) is an expensive process, you can imagine leaving that step out
Dustin N.
·
also, just FYI, the evaluator functions you pass in are very flexible wrt to the call signature, you don't need to define them with an output argument if you don't plan on using the output in the eval, though in the example I outlined you probably would be
Dustin N.
·
With this method, you can imagine storing a relevance-ordered list of retrieved documents in the output of each dataset example, allowing you to calculate metrics against what you get from each retrieval task

Dustin N.
·
hi Teodor! Thanks so much for trying out datasets and experiments and we're so glad you like them. In order to build them out we revisited the concept of an an evaluator, and as of this moment we haven't reconciled our existing phoenix.evals evaluators with the evaluators used for experiments. This is something we really hope to get to in the near future, and if you have any thoughts about what you'd like to see please file a ticket. For the time being, if you give an example of the kind of evaluator you'd like to implement (with an example experiment task) We can try to show you some examples of how we'd implement it for experiments

Teodor C.

dataset = px.Client().upload_dataset(
        dataframe=full_documents_with_relevance_df,
        input_keys=['parent_id', 'input', 'reference', 'eval_label', 'eval_score', 'eval_explanation'],
        output_keys=[],
        dataset_name='retrieval-evaluation-dataset',
    )

I tried to define the evaluators in the following way, I'd appreciate if you could have a look and see if you spot anything that is weird:

import numpy as np
from sklearn.metrics import ndcg_score
from typing import Dict, Any
import pandas as pd

# Define k as a global variable
K = 10  # You can change this value as needed

def prepare_data(input_data: Dict[str, Any]) -> pd.DataFrame:
    df = pd.DataFrame([input_data])
    
    if 'eval_score' not in df.columns:
        raise ValueError("'eval_score' not found in input data")
    
    if 'document_position' not in df.columns:
        df['document_position'] = range(len(df))
    
    return df

def calculate_ndcg(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for _, group in grouped:
        n = max(K, len(group))
        eval_scores = np.zeros(n)
        doc_scores = np.zeros(n)
        eval_scores[:len(group)] = group['eval_score']
        doc_scores[:len(group)] = range(1, len(group) + 1)[::-1]  # Reverse rank as score
        
        try:
            result = ndcg_score([eval_scores], [doc_scores], k=K)
        except ValueError:
            result = np.nan
        
        results.append(result)
    
    return np.nanmean(results)

def calculate_precision_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = group['eval_score'].head(K).sum() / K
        results.append(result)
    
    return np.nanmean(results)

def calculate_hit_rate(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = int(group['eval_score'].head(K).sum() > 0)
        results.append(result)
    
    return np.nanmean(results)

def calculate_recall_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        relevant_at_k = group['eval_score'].head(K).sum()
        total_relevant = group['eval_score'].sum()
        result = relevant_at_k / total_relevant if total_relevant > 0 else 0
        results.append(result)
    
    return np.nanmean(results)

def ndcg_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_ndcg(df)
    return result

def precision_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_precision_at_k(df)
    return result

def hit_rate_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_hit_rate(df)
    return result

def recall_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_recall_at_k(df)
    return result

I am running my experiments like this:

run_experiment(
    dataset,
    task=lambda x: x,  # Identity function as we're not modifying the input
    evaluators=[
        ndcg_evaluator,
        precision_at_k_evaluator,
        hit_rate_evaluator,
        recall_at_k_evaluator
    ],
    experiment_name="retrieval_evaluation",
    experiment_metadata={"retrieval_system": "Evaluate the retrieval for the original user query", "k": K},
)

And the results are something like this (which is a little bit surprising that ndcg, hit rate and recall are always identical in value):

🔗 View this experiment: http://localhost:6007/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDoy

Experiment Summary (07/18/24 01:25 PM +0000)
--------------------------------------------
                  evaluator    n  n_scores  avg_score
0        hit_rate_evaluator  451       451   0.886918
1            ndcg_evaluator  451       451   0.886918
2  precision_at_k_evaluator  451       451   0.088692
3     recall_at_k_evaluator  451       451   0.886918

Tasks Summary (07/18/24 01:22 PM +0000)
---------------------------------------
   n_examples  n_runs  n_errors
0         451     451         0

❤️1

Dustin N.
·
Hey I’m so sorry for the delayed response, I’ve given this a pass but probably need a little bit to think about this. I’ll respond here with an idea!
Teodor C.
·
no worries Dustin! appreciate any insight whenever you get a chance to look :)
Dustin N.
·
hey Teodor C. thanks for the patience! I think coming up with patterns for evaluating retrieval is going to be a super important use-case, so I hope we can find a solution that works well for you. I think I'd personally make the retrieval process itself the task, instead of relying on tracing the entire application at once. Each experiment can track the performance of different retrieval tasks (either with different settings, techniques, etc). The dataset would either be: an input (query) that you use for retrieval, data to seed your corpus, or both! You can then define a task that simply does your retrieval step:
def retrieval_task(input): corpus = input['corpus'] query = input['query'] # set up retrieval based on the corpus (e.g. a vector store) # run retrieval using the query return {"retrieved_documents": [...]}
Of course, if setting up your vector store (or what have you) is an expensive process, you can imagine leaving that step out
Dustin N.
·
also, just FYI, the evaluator functions you pass in are very flexible wrt to the call signature, you don't need to define them with an output argument if you don't plan on using the output in the eval, though in the example I outlined you probably would be
Dustin N.
·
With this method, you can imagine storing a relevance-ordered list of retrieved documents in the output of each dataset example, allowing you to calculate metrics against what you get from each retrieval task