hello! I have a question regarding RAG evaluation! I've followed the steps presented in the RAG Evaluation webinar on Jan 31st (btw, great webinar Amber R. and Mikyo 🙂 ). The content focuses on evaluating the retrieved nodes, however in my scenario I have the top k = 20 with hybrid search retrieval and a reranker that outputs the top n = 5. I was wondering if you have any examples around how you're evaluating the RAG when a reranker is involved? Thank you!
Teodor C. We did test a re-ranker as part of the benchmarking scripts we built out: https://docs.arize.com/phoenix/llm-evals/benchmarking-retrieval-rag Interestingly it did improve NDCG/MRR metrics (re-ranking was working) BUT it did not improve actual Q&A results. It could be related to the following (note GPT-4 is rock solid in retrieval independent of placement in context) https://twitter.com/aparnadhinak/status/1735678863814938695 We are working on some research to be released at the end of the month that will give us better answers on to rank or not to re-rank.
There is probably a case to re-rank a large set of documents, and window that re-ranked list down before putting in the context window. We just tested re-ranking the context window (15 chunks in window, re-rank is same 15 chnks). I know there are options that re-rank then window down (100 chunks re-rank, only put top 10 in window) - which probably would have a large effect on end results.
Example of setting up some of these metrics: https://docs.arize.com/phoenix/llm-evals/quickstart-retrieval-evals
Mikyo Jason - thank you for the quick answers and providing the reference documentation. I've tried to make the same relevancy evaluation for the reranked documents and try to obtain the hallucinations / QA correctness assessments as well. I'll try to explain what i've done (sorry in advance for the long message) and also highlight some of the challenges that I ran into. So I'll start from the beginning, I'm using llamaindex sentence window metadata replacement mode with hybrid search and a finetuned embedding model:
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window"),
reranker
],
text_qa_template=qa_template,
vector_store_query_mode="hybrid",
alpha=0.5
)The way I see it, while I do care about the top 20 retrieved nodes, what matters the most is what the reranker outputs, the top 8 nodes that will be used by the LLM to create the answer so my main focus (at the moment at least) is to make sure that what gets to the LLM is relevant. Here is what I've done to extract the relevancy from the reranked nodes (let me know if i took a wrong approach or if there is an easier one):
from phoenix.trace.dsl import SpanQuery
query = SpanQuery().where(
"span_kind" == "RERANKER",
).select(
input="reranker.query",
).explode(
"reranker.output_documents",
reference = "document.content",
document_score = "document.score"
)
reranked_docs_df = px.active_session().query_spans(query)
from phoenix.experimental.evals import (
RelevanceEvaluator,
run_evals,
)
from phoenix.experimental.evals import OpenAIModel
relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))
reranked_documents_relevance_df = run_evals(
evaluators=[relevance_evaluator],
dataframe=reranked_docs_df,
provide_explanation=True,
concurrency=20,
)[0]
reranked_documents_with_relevance_df = pd.concat(
[reranked_docs_df, reranked_documents_relevance_df, reranked_documents_relevance_df.add_prefix("rerank_eval_")], axis=1
)
reranking_df = px.active_session().get_spans_dataframe("span_kind == 'RERANKER'")
reranking_evaluation_dataframe = pd.concat(
[
reranking_df["attributes.reranker.query"],
ndcg_at_5.add_prefix("ncdg@5_"),
precision_at_5.add_prefix("precision@5_"),
hit,
],
axis=1,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
px.log_evaluations(
SpanEvaluations(dataframe=ndcg_at_5, eval_name="ndcg@5"),
SpanEvaluations(dataframe=precision_at_5, eval_name="precision@5"),
DocumentEvaluations(dataframe=reranked_documents_with_relevance_df, eval_name="rerank_relevance"),) This seems to provide the metrics (but not like for retrieval where you can see the relevancy for the individual document) and so far is good enough for me. The main challenge that I got is that for the hallucinations / QA correctness is that I was expecting that these two metrics would be calculated based on the information from the reranker and not from the original retrieved nodes and plus the fact that the llamaindex sentence window retrieval basically uses only one sentence for retrieval and saves the surrounding sentences in metadata so the hallucination / QA metrics were evaluated on simple sentences instead of the entire window. I've worked around it in sort of a dirty way (it works 🤷♂️ ), let me know if you think I should take a different approach:
get the reranked dataframe:
from phoenix.trace.dsl import SpanQuery
query = SpanQuery().where(
"span_kind" == "RERANKER",
).select(
input="reranker.query",
).explode(
"reranker.output_documents",
reference = "document.content",
document_score = "document.score"
)
reranked_docs_df = px.active_session().query_spans(query)
concatenate all the window nodes:
reranked_docs_to_llm_concat_df = reranked_docs_df.groupby('input')['reference'].apply("\n".join).reset_index()
get the qa_with_reference df
from phoenix.session.evaluation import get_qa_with_reference
qa_with_reference_df = get_qa_with_reference(px.active_session())
qa_with_reference_df
replace the references from the qa_with_reference with the concatenated references from the reranked nodes:
def update_references(final_df, original_df):
if final_df.index.name == "context.span_id":
final_df = final_df.reset_index()
merged_df = final_df.merge(original_df[['input', 'reference']], on='input', how='left', suffixes=("","_original"))
merged_df['reference'] = merged_df['reference_original']
merged_df.drop(columns=['reference_original'], inplace=True)
if 'context.span_id' in merged_df:
merged_df.set_index('context.span_id', inplace=True)
return merged_df
reranked_references_to_llm_df = update_references(qa_with_reference_df,reranked_docs_to_llm_concat_df)
ran the QA/Hallucination evaluation on the latest df
Let me know your thoughts on this, sorry again for the super long message
Awesome work, Teodor! I haven’t recreated your operating condition exactly, but looking at your code for the reranker, I think you can try the following modified version of our qa_with_reference query. Screenshot is my output when I tried it.
pd.concat(
px.Client().query_spans(
SpanQuery().select(input="input.value", output="output.value").where("parent_id is None"), # root span
SpanQuery()
.where("span_kind == 'RERANKER'")
.select(span_id="parent_id") # substitute parent_id for span_id so the join works
.concat(
"reranker.output_documents",
reference="document.content",
),
),
axis=1,
join="inner",
)if joining on parent_id doesn’t work you can also join on trace_id and the reset the index back to span_id
pd.concat(
px.Client().query_spans(
SpanQuery()
.with_index("trace_id")
.select("span_id", input="input.value", output="output.value")
.where("parent_id is None"),
SpanQuery()
.with_index("trace_id")
.where("span_kind == 'RERANKER'")
.concat(
"reranker.output_documents",
reference="document.content",
),
),
axis=1,
join="inner",
).set_index("context.span_id")the surrounding sentences in metadata…I’ve worked around it
Also, you mention metadata but i don’t see metadata anywhere in your code. Was your intention to concatenate the metadata (instead of document.content)?
