Tea | Arize AI Community

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

I thought so, I just needed to check if there is any functionality available in phoenix that already expects that. I'll do it this way then, it shouldn't be too hard. Thanks!:)

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

I have another question that is not directly connected to the issue I had above, however it is a result of testing the evals above. Those evals work great if LLM as a judge has to make only one or more parallel calls, however, if there needs to be two steps done, for instance tool A has to be called and then tool B has to be called, then the second call is always going to be labelled as incorrect. For instance I have a tool A (which gets the temperature of a chosen city) and a tool B which is a calculator. My question for instance was: "Get me the temperature of NY and London and sum them up". For this twice get_temperature should be called and then calculator to sum up the temperatures. The row relating to this LLM call in this case correctly got correct label: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_lfpBDHa', 'function': {'name': 'get_temperature', 'arguments': '{"city":"NY"}'}}}, {'tool_call': {'id': 'call_1oHhZs', 'function': {'name': 'get_temperature', 'arguments': '{"city":"London"}'}}}], 'role': 'assistant'}}] However, the call to LLM that is done after the one above is this one: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_LhbogWS', 'function': {'name': 'calculator_tool', 'arguments': '{"equation":"14 + 18"}'}}}], 'role': 'assistant'}}] Which is correct but the LLM as a judge is going to label it with incorrect label. The explanation is as follows:

The tool called is the calculator_tool, which is intended for performing calculations. However, the question specifically asks for the current temperature in NY and London and their sum, which requires the use of the get_temperature to retrieve the temperature first. The correct tool to answer the question is not the calculator_tool, but rather the get_temperature, which was correctly called for both cities. Therefore, the choice of the calculator_tool is incorrect as it does not directly answer the question without the prior price retrieval.

So basically the question is: does the evaluation with the LLM as a judge allows somehow to evaluate also the agents that need multiple steps/multiple sequential steps to the end answer? And if it does allow for that how do I do that? Thank you in advance for your answer again!

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

thank you so much John! I am on it! Thank you again❤️

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

No worries, I appreciate your help so much! I have to present the Phoenix evals on openAI SDK tomorrow afternoon so I am still working on figuring this out:D I was just guessing the same, however I can't figure it out where that context.span_id is usually created. I am using openAI SDK so for the traces all I need to do is: tracer_provider = register( project_name="project_name" auto_instrument=True OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider) However, later I am calling just this agent_output = await Runner.run(triage_agent, prompt) and I can see traces in the app. But the thing is that apparently this creates context.span_id somewhere in the background and I dont know how to access it. Those context.span_ids you saw above are indeed wrong-I created them manually back then when I didnt know that I actually have to "catch" them somewhere in the code while creating pandas dataframe (lets call it X) from the agent_output. That X is then later passed to the llm_classify function and yeah, if content.span_id is not created as it should (while running the agent) then the evals are nowhere to be seen. Do you maybe have any suggestion how could I create/catch those span_ids?

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

I imported my dataframe into the jupiter notebook that I liked above to demonstrate that the dataframe tool_call_evals in my case and in the experiment provided by Phoenix have the same structure. I really dont know what else could go wrong here when calling this and trying to log results back to Phoenix: px.Client().log_evaluations( SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval), )

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

I totally understand this balance between the ease of use and customizabiliy-makes sense. However, sometimes it's a bit confusing what phoenix needs in order to work like in my current case with the evals🥲 Unfortunately even when i switched to "all" in toggle I cant see the eval scores. Do you have any other idea what could I do? I ran also the example provided by you Phoenix guys just to see if it doesnt work for me on general but the evals from this jupyter notebook (which I followed with implementation: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4) did work for me. So something is off in my case

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

basically eval labels are nowhere to be seen:

Commented on Implementing Function Calling Evals for OpenAI SDK...·Posted inPhoenix Support

Tea

Hi John, at the end I managed to do it. It was difficult to undestand how to export the outputs of the OpenAI SDK into the dataframe which could be then run in llm_classify method. Also this part of the code is outdated in this jupyter notebook:

template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
        "{tool_definitions}",
        "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
    )

However, I did manage to run the llm_classify method, however, this is not successfully logged into Phoenix and I dont know why. I use this code, whereas the dataframe is pasted in the photo but I dont see it in the app as I should. I see the traces from calling the agent and also there is information about running the llm evaluation in the phoenix app as you see it in the pasted picture, however, I dont see it anywhere this sign that would suggest the percentage of how successful was the evaluation:

Posted in Phoenix Support·

Tea

Implementing Function Calling Evals for OpenAI SDK Agent

Hi there, I am trying to implement evals for my agent that is developed with openAI SDK in the same matter such as it is implemented here: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4 But I am not able to do so. I am trying to implement that part in the code: "Function Calling Evals using LLM as a Judge" I am using a triage agent defined like that:

triage_agent = Agent(
    name="triage_agent",
    instructions="Always use appropriate tool based on the request.",
    model="gpt-4o-2024-08-06",
    tools=[calculator_tool,some_other_tool],
)

and I am running it with

output = await Runner.run(triage_agent, "How much is 28763+8763?")

Any help would be greatly appreciated!

9Comments

Commented on Reviewing Arize AI Discussion Thread·Posted inPhoenix Support

Tea

that indeed solved the issue, thanks a lot Xander S.!