Teodor C.

Commented on Integrating Retrieval and QA Evaluations in Experi...·Posted inPhoenix Support

no worries Dustin! appreciate any insight whenever you get a chance to look :)

Commented on Integrating Retrieval and QA Evaluations in Experi...·Posted inPhoenix Support

Hey Dustin N., Coming with a little update here. I was trying to adapt the previous RAG Relevance evaluation that I used from Arize to the new format with datasets & experiments. Here is how I upload my dataset, probably is not the best way to do it (I couldn't get the output keys to work in any shape or form):

dataset = px.Client().upload_dataset(
        dataframe=full_documents_with_relevance_df,
        input_keys=['parent_id', 'input', 'reference', 'eval_label', 'eval_score', 'eval_explanation'],
        output_keys=[],
        dataset_name='retrieval-evaluation-dataset',
    )

I tried to define the evaluators in the following way, I'd appreciate if you could have a look and see if you spot anything that is weird:

import numpy as np
from sklearn.metrics import ndcg_score
from typing import Dict, Any
import pandas as pd

# Define k as a global variable
K = 10  # You can change this value as needed

def prepare_data(input_data: Dict[str, Any]) -> pd.DataFrame:
    df = pd.DataFrame([input_data])
    
    if 'eval_score' not in df.columns:
        raise ValueError("'eval_score' not found in input data")
    
    if 'document_position' not in df.columns:
        df['document_position'] = range(len(df))
    
    return df

def calculate_ndcg(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for _, group in grouped:
        n = max(K, len(group))
        eval_scores = np.zeros(n)
        doc_scores = np.zeros(n)
        eval_scores[:len(group)] = group['eval_score']
        doc_scores[:len(group)] = range(1, len(group) + 1)[::-1]  # Reverse rank as score
        
        try:
            result = ndcg_score([eval_scores], [doc_scores], k=K)
        except ValueError:
            result = np.nan
        
        results.append(result)
    
    return np.nanmean(results)

def calculate_precision_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = group['eval_score'].head(K).sum() / K
        results.append(result)
    
    return np.nanmean(results)

def calculate_hit_rate(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        result = int(group['eval_score'].head(K).sum() > 0)
        results.append(result)
    
    return np.nanmean(results)

def calculate_recall_at_k(df: pd.DataFrame) -> float:
    global K
    grouped = df.groupby('parent_id')
    results = []
    
    for name, group in grouped:
        relevant_at_k = group['eval_score'].head(K).sum()
        total_relevant = group['eval_score'].sum()
        result = relevant_at_k / total_relevant if total_relevant > 0 else 0
        results.append(result)
    
    return np.nanmean(results)

def ndcg_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_ndcg(df)
    return result

def precision_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_precision_at_k(df)
    return result

def hit_rate_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_hit_rate(df)
    return result

def recall_at_k_evaluator(input: Dict[str, Any], output: Dict[str, Any]) -> float:
    df = prepare_data(input)
    result = calculate_recall_at_k(df)
    return result

I am running my experiments like this:

run_experiment(
    dataset,
    task=lambda x: x,  # Identity function as we're not modifying the input
    evaluators=[
        ndcg_evaluator,
        precision_at_k_evaluator,
        hit_rate_evaluator,
        recall_at_k_evaluator
    ],
    experiment_name="retrieval_evaluation",
    experiment_metadata={"retrieval_system": "Evaluate the retrieval for the original user query", "k": K},
)

And the results are something like this (which is a little bit surprising that ndcg, hit rate and recall are always identical in value):

🔗 View this experiment: http://localhost:6007/datasets/RGF0YXNldDox/compare?experimentId=RXhwZXJpbWVudDoy

Experiment Summary (07/18/24 01:25 PM +0000)
--------------------------------------------
                  evaluator    n  n_scores  avg_score
0        hit_rate_evaluator  451       451   0.886918
1            ndcg_evaluator  451       451   0.886918
2  precision_at_k_evaluator  451       451   0.088692
3     recall_at_k_evaluator  451       451   0.886918

Tasks Summary (07/18/24 01:22 PM +0000)
---------------------------------------
   n_examples  n_runs  n_errors
0         451     451         0

So basically my requirement is more or less straigthforward: every time I make changes to the retrieval part ( prompt, embeddings, new docs, chunk size etc), I am exporting the new trace file and run the evaluators against the new traces. Does it make sense? Plus, do you suggest a different and maybe a more optimal way of doing this? BTW, my next step is to do the same but for the hallucination & correctness part. Thanks in advance and sorry for the ultra long message 😭

Posted in Phoenix Support·

Teodor C.

Integrating Retrieval and QA Evaluations in Experiments

hello! first of all, amazing thing with the datasets&experiments - really awesome stuff ❤️ . Question - do you guys have any examples of how would the retrieval evaluation or QA / Hallucination evaluations would be integrated with the experiments ? curious to see an example of how would you do it or any suggestion that you might have 🙂

7Comments

Commented on Setting Up LiteLLM with RelevanceEvaluator for Rem...·Posted inPhoenix Support

Teodor C.

thanks. i ll update the thread once i get it running:)

Commented on Setting Up LiteLLM with RelevanceEvaluator for Rem...·Posted inPhoenix Support

Teodor C.

so i guess the question is more around where do you set the hostname for the litellm machine. for example if you use ollama you set an env variable, is it the same variable ?

Commented on Setting Up LiteLLM with RelevanceEvaluator for Rem...·Posted inPhoenix Support

Teodor C.

we use litellm with presidio for pii masking :)

Posted in Phoenix Support·

Teodor C.

Setting Up LiteLLM with RelevanceEvaluator for Remote Use

Hello Team, I'm trying to use Claude Sonnet 3.5 via LiteLLM. So far the only tutorial I found is regarding ollama running locally. How would one define the LiteLLM and the RelevanceEvaluator to point to a different machine running the litellm endpoint? Basically the content from here: https://docs.arize.com/phoenix/api/evaluation-models#litellmmodel

9Comments

Commented on KeyError Issue with Relevance Evaluator in GCP Env...·Posted inPhoenix Support

Teodor C.

will do, thanks !

Commented on KeyError Issue with Relevance Evaluator in GCP Env...·Posted inPhoenix Support

Teodor C.

seems to be working Roger Y.! thanks a lot as always! one additional question ( i can raise a separate thread if needed ) - I'm trying to use Claude Sonnet 3.5 via LiteLLM. So far the only tutorial I found is regarding ollama running locally. How would one define the LiteLLM and the RelevanceEvaluator to point to a LiteLLM machine ?

Commented on KeyError Issue with Relevance Evaluator in GCP Env...·Posted inPhoenix Support

Teodor C.

see DM