Hey all - I'm trying to get started here : https://docs.arize.com/phoenix/evaluation/evals#id-3.-evaluate-and-log-results but I have some questions on the provided Hallucinations and QA Evaluations example - I have phoenix running with autoinstrumentation traces. I need to:
download traces from phoenix
filter them somehow:
without support for sessions right now, do I make a new project for a specific experiment?,
or filter by metadata if I add an additional metadatafield type:experiment
somehow structure them into the required queries_df and spans_df
Cant find info about what these should look like?
also - spans_df doesnt appear in the sample code
Hi Trantor D.! Let me see if I can help:
filter them somehow:
without support for sessions right now, do I make a new project for a specific experiment?,
or filter by metadata if I add an additional metadatafield type:experiment
If you are using phoenix as a container, the easiest way is to switch projects, think if this as an environment. https://docs.arize.com/phoenix/tracing/how-to-tracing/customize-traces#log-to-a-specific-project
So that could look like
from openinference.semconv.resource import ResourceAttributes
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
resource = Resource(attributes={
ResourceAttributes.PROJECT_NAME: '<your-project-name>'
})
tracer_provider = trace_sdk.TracerProvider(resource=resource)
span_exporter = OTLPSpanExporter(endpoint="http://phoenix:6006/v1/traces")
span_processor = SimpleSpanProcessor(span_exporter=span_exporter)
tracer_provider.add_span_processor(span_processor=span_processor)
trace_api.set_tracer_provider(tracer_provider=tracer_provider)
# Add any auto-instrumentation you want
LlamaIndexInstrumentor().instrument()
in a notebook you can do something like:
import os
os.environ['PHOENIX_PROJECT_NAME'] = "<your-project-name>"Yeah - Id have to restart the pods to switch project-name which is ok I guess
somehow structure them into the required queries_df and spans_df
Cant find info about what these should look like?
also - spans_df doesnt appear in the sample code
Good question, so the required columns for QA and Hallucinations is the same: https://docs.arize.com/phoenix/tracing/how-to-tracing/llm-evaluations#span-evaluations you need a dataframe that has: context.span_id (the index) input - (the user query) output - (the output by the LLM) reference - any RAG
If using projects you do have to specify it. E.x.
import phoenix as px
from phoenix.trace.dsl import SpanQuery
# Get spans from a project
px.Client().get_spans_dataframe(project_name="<my-project>")
# Using the query DSL
query = SpanQuery().where("span_kind == 'CHAIN'").select(input="input.value")
px.Client().query_spans(query, project_name="<my-project>")I'm still confused how the grading happens on nested traces/spans. IE - if I pull all top level spans, itself has several sub-chain spans, and a retriever span, abd a final generation span - which do I send for grading?
do I have to make sure all values get bubbled up to the top level chain span for grading?
Right, Trantor, good point. Yes RAG can come from a nested retriever and is not available on the root span. so if you want to do grading by joining together spans from a single trace, you do need to construct a bit more complex query.
here's an example: https://github.com/Arize-ai/phoenix/blob/672cbedcea9746ee5ea1d6b61032931110a9b121/src/phoenix/trace/dsl/helpers.py#L61 return pd.concat( cast( List[pd.DataFrame], obj.query_spans( SpanQuery().select(**IO).where(IS_ROOT), SpanQuery() .where(IS_RETRIEVER) .select(span_id="parent_id") .concat( RETRIEVAL_DOCUMENTS, reference=DOCUMENT_CONTENT, ), start_time=start_time, stop_time=stop_time, project_name=project_name, ), ), axis=1, join="inner", )
We have some examples here: https://docs.arize.com/phoenix/tracing/how-to-tracing/extract-data-from-spans
My nested traces look like this:
top level trace
chain1 -> property1
chain2 -> propert2
retriever
final generation
Id like to grade I guess just the final generation span for hallucination and QA correctness, and the retriever docs, etc.
Ah I see - there is a bit of "complicated" pandas work required
