Considering DSPy LLM for Regular Optimization Evaluations

Xander S. · 2024-05-15T22:30:27.872Z

From @Arturo H. I see that in your documentation you suggest running evals online via a cron job. I am starting to use DSPy, and It was complex to get optimization evaluations to show up. Would you not consider DSPy LLM as a Judge or even DSPy Evals to be supported as regular evaluations for tracking purposes?

Xander S.
·
Arturo H. There are a few things to unpack here. Phoenix is able to accept any evaluations you like. If you would like to ingest evals computed using DSPy, you can definitely do so using px.Client.log_evaluations! We'd love it if you would share with us what is working best for your use-case. It sounds like you also have in mind some kind of service that runs evals online or as batch jobs?
Arturo H.
·
I am still exploring, I get the span and send it in the forward pass like this self.prog(question=question, span_id=current_span.get_span_context().span_id) On DSPy because of the optimization process, is nice to be able to get a reference to the actual span, and not have to worry about querying it later, with a proper id. I could just add an id attribute to the span and query it later. Because the optimization, there are just many variables changing. Once DSPy is used in production. It's not an issue. I guess it's just about what developer experience you want to provide.
Xander S.
·
self.prog(question=question, span_id=current_span.get_span_context().span_id)
I'm probably not understanding what this line of code is trying to accomplish. Are you trying to query spans created during the optimization step?
Arturo H.
·
class cs_mvs_links(dspy.Module): def __init__(self): super().__init__() self.prog = dspy.ChainOfThought(CS_MVS_Links) def forward(self, question): context = context_api.get_current() current_span = trace_api.get_current_span(context) return self.prog(question=question, span_id=current_span.get_span_context().span_id)
I don't recall if the get_current_span is needed or if get_current is enought. but because the code is executed in the context of the span. It is able to get the span_id. And I pass it to the forward step so I can recover it during the metric/teacher call made by DSPy. Then as soon as the metric evaluation is made I can pair it up and place it in a memory array. And then I send it when it's done.
Xander S.
·
Interesting. I'm not sure I'm deep enough in the weeds with DSPy to fully understand, but it sounds like you are trying to correlate different spans between training and inference?
Arturo H.
·
The idea of DSPy is that it runs many inferences and evolves your or my prompts to find the best one, based on test data. And it evaluates the agent/prompt as it evolves. I am using Arize to see if the prompt evolution is working as expected. There could be thousands of inference/evaluation pairs or even deeper flows. I suspect your number of DSPy users is on the increase.... Time will tell, as you get other feedback.
felix
·
Arturo H. Where you able to get it to work?
felix
·
i can not figure class and this: import # def that out. i do this: RAG(dspy.Module): def __init__(self, num_passages=3): super().__init__() self.retrieve = dspy.Retrieve(k=num_passages) self.generate_answer = dspy.ChainOfThought(GenerateAnswer) def forward(self, question): current_span = trace_api.get_current_span() print(current_span.get_span_context().span_id) context = self.retrieve(question).passages prediction = self.generate_answer(context=context, question=question) return dspy.Prediction(context=context, answer=prediction.answer, span_id=current_span.get_span_context().span_id) pandas as pd Initialize a list to store evaluation data evaluation_data = [] validate_context_and_answer(example, pred, trace=None): answer_EM = dspy.evaluate.answer_exact_match(example, pred) answer_PM = dspy.evaluate.answer_passage_match(example, pred) # Retrieve the span_id from the prediction span_id = getattr(pred, 'span_id', None) print(f"Span ID during evaluation: {span_id}") if span_id is not None: metrics_data = { 'context.span_id': span_id, 'label': 'correct' if answer_EM else 'incorrect', 'value': int(answer_EM), 'explanation': "Explanation for each prediction" } evaluation_data.append(metrics_data) return answer_EM and answer_PM
felix
·
and when i try to push it with: qa_correctness_eval_df = pd.DataFrame(evaluation_data).set_index('context.span_id') print(f"QA Correctness Evaluation DataFrame: {qa_correctness_eval_df}") # Log the evaluations to Phoenix Arize from phoenix.trace import SpanEvaluations px.Client().log_evaluations( SpanEvaluations( dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness" ) ) i get back QA Correctness Evaluation DataFrame: label value explanation context.span_id 10441118185029422972 incorrect 0 Explanation for each prediction 5262738036224594025 correct 1 Explanation for each predictio but then when i do hm = px.Client().get_spans_dataframe(project_name="Span-test") print(hm.join(qa_correctness_eval_df, how='inner')) there is nothing
felix
·
The idea is, that the dspy.module returns the span_id, which i than can use to link the evaluation to it. but the span_id returned from the dspy.module does not exist in the dataframe from px.Client().get_spans_dataframe(project_name="Span-test")

Arturo H.

Try this:


def validate_context_and_answer(example, pred, trace=None):
    span_index = next(attr['span_id'] for attr in trace[-1] if isinstance(attr, dict) and 'span_id' in attr)

I got the span into the evaluation that way. Then I just appended the span id and the result to another data structure. Then I processed the data structure to add the spans manually like this

def log_evaluations(endpoint, eval_data):
    tracer_provider.force_flush()
    client = px.Client(endpoint=endpoint)
    eval_dataframe = pd.DataFrame(eval_data, columns=['span_id', 'score', 'label', 'explanation'])
    eval_dataframe.set_index('span_id', inplace=True)
    span_evaluations = SpanEvaluations(dataframe=eval_dataframe, eval_name="correctness_evaluations")
    client.log_evaluations(span_evaluations)
    return span_evaluations

Arturo H.
·
If you are just trying to get evaluations to work, you can use arize's own evaluations. If you want to have the DSPy evaluations that are made as part of the optimization process. That that's where my reply might help you

felix

thanks for the help! where do you get the trace from?

def validate_context_and_answer(example, pred, trace=None):
    span_index = next(attr['span_id'] for attr in trace[-1] if isinstance(attr, dict) and 'span_id' in attr)

felix
·
i want to log the the DSPy evaluations that are made as part of the optimization process

Arturo H.

It's added at the forward step

    def forward(self, question):
        current_span = trace_api.get_current_span()
        print(current_span.get_span_context().span_id)
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer, span_id=current_span.get_span_context().span_id)

That's where data from the result comes back to the framework. Adding the span_id as part of the prediction is a bit of a hack. But you can just ignore it.

Considering DSPy LLM for Regular Optimization Evaluations

36 comments

Considering DSPy LLM for Regular Optimization Evaluations

36 comments