To run evaluations against three different query engines and compare their results, using metadata tags is generally preferred. This approach allows you to keep all the data within a single project while differentiating between the query engines using metadata. This makes it easier to compare results side-by-side.
Here's how you can achieve this:
Using Metadata
You can add metadata to the spans to differentiate between the query engines:
```
from openinference.instrumentation import using_metadata
from llama_index.core import set_global_handler
import os
PHOENIX_API_KEY = os.getenv("PHOENIX_API_KEY")
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
set_global_handler("arize_phoenix", endpoint="https://llamatrace.com/v1/traces")
query_engine_1 = MultimodalQueryEngine(
retriever=index1.as_retriever(similarity_top_k=3),
multi_modal_llm=azure_openai_mm_llm,
)
query_engine_2 = MultimodalQueryEngine(
retriever=index2.as_retriever(similarity_top_k=3),
multi_modal_llm=azure_openai_mm_llm,
)
query_engine_3 = MultimodalQueryEngine(
retriever=index3.as_retriever(similarity_top_k=3),
multi_modal_llm=azure_openai_mm_llm,
)
with using_metadata({"engine": "query_engine_1"}):
result_1 = query_engine_1.query("Your query here")
with using_metadata({"engine": "query_engine_2"}):
result_2 = query_engine_2.query("Your query here")
with using_metadata({"engine": "query_engine_3"}):
result_3 = query_engine_3.query("Your query here")
LBKSThis approach allows you to tag each query with metadata indicating which query engine was used. You can then filter and compare the results based on these tags.
Would you like more details on setting up the evaluations or any other aspect?
Sources:
- [GitHub Issue #3375](https://github.com/Arize-ai/phoenix/issues/3375)