Evaluating RAG with Rerankers: Best Practices and Examples

Teodor C. · 2024-02-05T11:13:40.601Z

hello! I have a question regarding RAG evaluation! I've followed the steps presented in the RAG Evaluation webinar on Jan 31st (btw, great webinar @Amber R. and @Mikyo 🙂 ). The content focuses on evaluating the retrieved nodes, however in my scenario I have the top k = 20 with hybrid search retrieval and a reranker that outputs the top n = 5. I was wondering if you have any examples around how you're evaluating the RAG when a reranker is involved? Thank you!

9 comments

· Sorted by Oldest

Jason
·
Teodor C. We did test a re-ranker as part of the benchmarking scripts we built out: https://docs.arize.com/phoenix/llm-evals/benchmarking-retrieval-rag Interestingly it did improve NDCG/MRR metrics (re-ranking was working) BUT it did not improve actual Q&A results. It could be related to the following (note GPT-4 is rock solid in retrieval independent of placement in context) https://twitter.com/aparnadhinak/status/1735678863814938695 We are working on some research to be released at the end of the month that will give us better answers on to rank or not to re-rank.
Jason
·
There is probably a case to re-rank a large set of documents, and window that re-ranked list down before putting in the context window. We just tested re-ranking the context window (15 chunks in window, re-rank is same 15 chnks). I know there are options that re-rank then window down (100 chunks re-rank, only put top 10 in window) - which probably would have a large effect on end results.
Jason
·
Example of setting up some of these metrics: https://docs.arize.com/phoenix/llm-evals/quickstart-retrieval-evals

Teodor C.

Mikyo Jason - thank you for the quick answers and providing the reference documentation. I've tried to make the same relevancy evaluation for the reranked documents and try to obtain the hallucinations / QA correctness assessments as well. I'll try to explain what i've done (sorry in advance for the long message) and also highlight some of the challenges that I ran into. So I'll start from the beginning, I'm using llamaindex sentence window metadata replacement mode with hybrid search and a finetuned embedding model:

query_engine = index.as_query_engine(
                        similarity_top_k=20,
                        node_postprocessors=[
                            MetadataReplacementPostProcessor(target_metadata_key="window"),
                            reranker
                        ],
                        text_qa_template=qa_template,
                        vector_store_query_mode="hybrid",
                        alpha=0.5
                )

The way I see it, while I do care about the top 20 retrieved nodes, what matters the most is what the reranker outputs, the top 8 nodes that will be used by the LLM to create the answer so my main focus (at the moment at least) is to make sure that what gets to the LLM is relevant. Here is what I've done to extract the relevancy from the reranked nodes (let me know if i took a wrong approach or if there is an easier one):

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    "span_kind" == "RERANKER",
).select(
    input="reranker.query",
).explode(
    "reranker.output_documents",
    reference = "document.content",
    document_score = "document.score"
)

reranked_docs_df = px.active_session().query_spans(query)

from phoenix.experimental.evals import (
    RelevanceEvaluator,
    run_evals,
)
from phoenix.experimental.evals import OpenAIModel

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

reranked_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=reranked_docs_df,
    provide_explanation=True,
    concurrency=20,
)[0]
reranked_documents_with_relevance_df = pd.concat(
    [reranked_docs_df, reranked_documents_relevance_df, reranked_documents_relevance_df.add_prefix("rerank_eval_")], axis=1
)

reranking_df = px.active_session().get_spans_dataframe("span_kind == 'RERANKER'")
reranking_evaluation_dataframe = pd.concat(
    [
        reranking_df["attributes.reranker.query"],
        ndcg_at_5.add_prefix("ncdg@5_"),
        precision_at_5.add_prefix("precision@5_"),
        hit,
    ],
    axis=1,
)

from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_5, eval_name="ndcg@5"),
    SpanEvaluations(dataframe=precision_at_5, eval_name="precision@5"),
    DocumentEvaluations(dataframe=reranked_documents_with_relevance_df, eval_name="rerank_relevance"),

) This seems to provide the metrics (but not like for retrieval where you can see the relevancy for the individual document) and so far is good enough for me. The main challenge that I got is that for the hallucinations / QA correctness is that I was expecting that these two metrics would be calculated based on the information from the reranker and not from the original retrieved nodes and plus the fact that the llamaindex sentence window retrieval basically uses only one sentence for retrieval and saves the surrounding sentences in metadata so the hallucination / QA metrics were evaluated on simple sentences instead of the entire window. I've worked around it in sort of a dirty way (it works 🤷‍♂️ ), let me know if you think I should take a different approach:

get the reranked dataframe:

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    "span_kind" == "RERANKER",
).select(
    input="reranker.query",
).explode(
    "reranker.output_documents",
    reference = "document.content",
    document_score = "document.score"
)

reranked_docs_df = px.active_session().query_spans(query)

concatenate all the window nodes:

reranked_docs_to_llm_concat_df = reranked_docs_df.groupby('input')['reference'].apply("\n".join).reset_index()

get the qa_with_reference df

from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.active_session())
qa_with_reference_df

replace the references from the qa_with_reference with the concatenated references from the reranked nodes:

def update_references(final_df, original_df):
    if final_df.index.name == "context.span_id":
        final_df = final_df.reset_index()
        
    merged_df = final_df.merge(original_df[['input', 'reference']], on='input', how='left', suffixes=("","_original"))
    merged_df['reference'] = merged_df['reference_original']
    merged_df.drop(columns=['reference_original'], inplace=True)

    if 'context.span_id' in merged_df:
        merged_df.set_index('context.span_id', inplace=True)
    
    return merged_df

reranked_references_to_llm_df = update_references(qa_with_reference_df,reranked_docs_to_llm_concat_df)

ran the QA/Hallucination evaluation on the latest df

Let me know your thoughts on this, sorry again for the super long message

Mikyo
·
Hey Teodor C. - super impressed with how far you made it. You are totally right that the qa_with_reference convenience method does assume the documents come from a very specific topology of spans but that this doesn't work for every case - thus the query DSL.
Mikyo
·
I'm gonna loop in one of our engineers Roger Y. - he probably can take a look at your dataframe code - he can maybe identify some simplifications. But at a high level I think you've got the right solution here for rerankers and sentence windows!

Roger Y.

Awesome work, Teodor! I haven’t recreated your operating condition exactly, but looking at your code for the reranker, I think you can try the following modified version of our qa_with_reference query. Screenshot is my output when I tried it.

pd.concat(
    px.Client().query_spans(
        SpanQuery().select(input="input.value", output="output.value").where("parent_id is None"),  # root span
        SpanQuery()
        .where("span_kind == 'RERANKER'")
        .select(span_id="parent_id")  # substitute parent_id for span_id so the join works 
        .concat(
            "reranker.output_documents",
            reference="document.content",
        ),
    ),
    axis=1,
    join="inner",
)

Roger Y.

if joining on parent_id doesn’t work you can also join on trace_id and the reset the index back to span_id

pd.concat(
    px.Client().query_spans(
        SpanQuery()
        .with_index("trace_id")
        .select("span_id", input="input.value", output="output.value")
        .where("parent_id is None"),
        SpanQuery()
        .with_index("trace_id")
        .where("span_kind == 'RERANKER'")
        .concat(
            "reranker.output_documents",
            reference="document.content",
        ),
    ),
    axis=1,
    join="inner",
).set_index("context.span_id")

Roger Y.
·
the surrounding sentences in metadata…I’ve worked around it
Also, you mention metadata but i don’t see metadata anywhere in your code. Was your intention to concatenate the metadata (instead of document.content)?

Evaluating RAG with Rerankers: Best Practices and Examples | Arize AI Community

Jason
·
Teodor C. We did test a re-ranker as part of the benchmarking scripts we built out: https://docs.arize.com/phoenix/llm-evals/benchmarking-retrieval-rag Interestingly it did improve NDCG/MRR metrics (re-ranking was working) BUT it did not improve actual Q&A results. It could be related to the following (note GPT-4 is rock solid in retrieval independent of placement in context) https://twitter.com/aparnadhinak/status/1735678863814938695 We are working on some research to be released at the end of the month that will give us better answers on to rank or not to re-rank.
Jason
·
There is probably a case to re-rank a large set of documents, and window that re-ranked list down before putting in the context window. We just tested re-ranking the context window (15 chunks in window, re-rank is same 15 chnks). I know there are options that re-rank then window down (100 chunks re-rank, only put top 10 in window) - which probably would have a large effect on end results.
Jason
·
Example of setting up some of these metrics: https://docs.arize.com/phoenix/llm-evals/quickstart-retrieval-evals

Teodor C.

query_engine = index.as_query_engine(
                        similarity_top_k=20,
                        node_postprocessors=[
                            MetadataReplacementPostProcessor(target_metadata_key="window"),
                            reranker
                        ],
                        text_qa_template=qa_template,
                        vector_store_query_mode="hybrid",
                        alpha=0.5
                )

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    "span_kind" == "RERANKER",
).select(
    input="reranker.query",
).explode(
    "reranker.output_documents",
    reference = "document.content",
    document_score = "document.score"
)

reranked_docs_df = px.active_session().query_spans(query)

from phoenix.experimental.evals import (
    RelevanceEvaluator,
    run_evals,
)
from phoenix.experimental.evals import OpenAIModel

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

reranked_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=reranked_docs_df,
    provide_explanation=True,
    concurrency=20,
)[0]
reranked_documents_with_relevance_df = pd.concat(
    [reranked_docs_df, reranked_documents_relevance_df, reranked_documents_relevance_df.add_prefix("rerank_eval_")], axis=1
)

reranking_df = px.active_session().get_spans_dataframe("span_kind == 'RERANKER'")
reranking_evaluation_dataframe = pd.concat(
    [
        reranking_df["attributes.reranker.query"],
        ndcg_at_5.add_prefix("ncdg@5_"),
        precision_at_5.add_prefix("precision@5_"),
        hit,
    ],
    axis=1,
)

from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_5, eval_name="ndcg@5"),
    SpanEvaluations(dataframe=precision_at_5, eval_name="precision@5"),
    DocumentEvaluations(dataframe=reranked_documents_with_relevance_df, eval_name="rerank_relevance"),

get the reranked dataframe:

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    "span_kind" == "RERANKER",
).select(
    input="reranker.query",
).explode(
    "reranker.output_documents",
    reference = "document.content",
    document_score = "document.score"
)

reranked_docs_df = px.active_session().query_spans(query)

concatenate all the window nodes:

reranked_docs_to_llm_concat_df = reranked_docs_df.groupby('input')['reference'].apply("\n".join).reset_index()

get the qa_with_reference df

from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.active_session())
qa_with_reference_df

replace the references from the qa_with_reference with the concatenated references from the reranked nodes:

def update_references(final_df, original_df):
    if final_df.index.name == "context.span_id":
        final_df = final_df.reset_index()
        
    merged_df = final_df.merge(original_df[['input', 'reference']], on='input', how='left', suffixes=("","_original"))
    merged_df['reference'] = merged_df['reference_original']
    merged_df.drop(columns=['reference_original'], inplace=True)

    if 'context.span_id' in merged_df:
        merged_df.set_index('context.span_id', inplace=True)
    
    return merged_df

reranked_references_to_llm_df = update_references(qa_with_reference_df,reranked_docs_to_llm_concat_df)

ran the QA/Hallucination evaluation on the latest df

Let me know your thoughts on this, sorry again for the super long message

Mikyo
·
Hey Teodor C. - super impressed with how far you made it. You are totally right that the qa_with_reference convenience method does assume the documents come from a very specific topology of spans but that this doesn't work for every case - thus the query DSL.
Mikyo
·
I'm gonna loop in one of our engineers Roger Y. - he probably can take a look at your dataframe code - he can maybe identify some simplifications. But at a high level I think you've got the right solution here for rerankers and sentence windows!

Roger Y.

pd.concat(
    px.Client().query_spans(
        SpanQuery().select(input="input.value", output="output.value").where("parent_id is None"),  # root span
        SpanQuery()
        .where("span_kind == 'RERANKER'")
        .select(span_id="parent_id")  # substitute parent_id for span_id so the join works 
        .concat(
            "reranker.output_documents",
            reference="document.content",
        ),
    ),
    axis=1,
    join="inner",
)

Roger Y.

if joining on parent_id doesn’t work you can also join on trace_id and the reset the index back to span_id

pd.concat(
    px.Client().query_spans(
        SpanQuery()
        .with_index("trace_id")
        .select("span_id", input="input.value", output="output.value")
        .where("parent_id is None"),
        SpanQuery()
        .with_index("trace_id")
        .where("span_kind == 'RERANKER'")
        .concat(
            "reranker.output_documents",
            reference="document.content",
        ),
    ),
    axis=1,
    join="inner",
).set_index("context.span_id")

Roger Y.
·
the surrounding sentences in metadata…I’ve worked around it
Also, you mention metadata but i don’t see metadata anywhere in your code. Was your intention to concatenate the metadata (instead of document.content)?