Hm - not sure why adding this column: retrieved_documents_eval["score"] allowed all the other values to be calculated?
ndcg - takes into account the document ordering, and their scores - my documents didnt come from OpenSearch with any scores (have to fix this)
precision - "what % of retrieved docs were relevant"? A simple label count
hit rate - "what % of rows have at least 1 relevant doc?"
Is this correct? Feel like I'm missing something
yeah so if you provide an eval on a document, we calculate all the IR metrics for you. Typically we just compute relevance so these metrics are derived on top of these
But why did I ahve to do that extra binary calculation for all of this to be "turned on"?
which binary calculation, for relevance?
None of these values were being calculated untill I did:
retrieved_documents_eval["score"] = (
retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)Yeah so the IR metrics are calculated off of the score column of the eval since we don't know if 'relevant' or 'irrelevent' is the positive label. So we only calculate using the score. Not sure if that answers your question. Our RelevanceEvaluator does the score mapping for you. The llm_ops_overview might be tad outdated
Hm - I don't think langchain OpenSearch retrievers actually can return document scores. In other words, opensearch_vector_search.asimilarity_search_with_score works, but if I use the "standard": opensearch_vector_search.as_retriever there is no with_score as part of a VectorStoreRetriever :L Not Phoenix's issue ofcourse, but evals made me realize this
Interesting, yeah we try to show the score that the vector store returns as part of the tracing. But we are pretty blind to what this score means because it can be cosine similarity or any other heuristic.
