Implementing Function Calling Evals for OpenAI SDK Agent

·Apr 11, 2025 05:14 PM

Hi there, I am trying to implement evals for my agent that is developed with openAI SDK in the same matter such as it is implemented here: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4 But I am not able to do so. I am trying to implement that part in the code: "Function Calling Evals using LLM as a Judge" I am using a triage agent defined like that:

triage_agent = Agent(
    name="triage_agent",
    instructions="Always use appropriate tool based on the request.",
    model="gpt-4o-2024-08-06",
    tools=[calculator_tool,some_other_tool],
)

and I am running it with

output = await Runner.run(triage_agent, "How much is 28763+8763?")

Any help would be greatly appreciated!

9 comments

· Sorted by Oldest

RunLLM
APP
·
To implement function calling evaluations using LLM as a judge, you can use the `llm_classify` function from Phoenix. This function allows you to evaluate whether the correct tool was used by the agent. Here's a basic outline of how you can set it up: 1. *Define the Evaluation Template*: Create a prompt template that specifies the criteria for evaluating the tool usage. This template should include the question, the tool called, and the expected output. 2. *Set Up the Evaluation Model*: Use an evaluation model like `OpenAIModel` to perform the classification. 3. *Run the Evaluation*: Use the `llm_classify` function to classify the tool usage based on the defined template and model. Here's a simplified example: ``` from phoenix.evals import llm_classify, OpenAIModel # Define your evaluation template TOOL_CALLING_PROMPT_TEMPLATE = """ You are evaluating whether the correct tool was used for the question. [Question]: {question} [Tool Called]: {tool_call} Your response must be "correct" or "incorrect". """ # Set up the evaluation model eval_model = OpenAIModel(model_name="gpt-4o-2024-08-06") # Create a dataframe with the data to evaluate import pandas as pd data = { "question": ["How much is 28763+8763?"], "tool_call": ["calculator_tool"] } df = pd.DataFrame(data) # Run the evaluation results = llm_classify( dataframe=df, template=TOOL_CALLING_PROMPT_TEMPLATE, model=eval_model, rails=["correct", "incorrect"] ) print(results) ``` This code sets up a basic evaluation to check if the correct tool was used for the given question. Adjust the template and data as needed for your specific use case. Would you like more detailed guidance on any specific part of this process? Sources: - <https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/tool-calling-eval|Agent Function Calling Eval>
💡 Hint: Mention RunLLM in the thread for followups.
Tea
·
Hi John, at the end I managed to do it. It was difficult to undestand how to export the outputs of the OpenAI SDK into the dataframe which could be then run in llm_classify method. Also this part of the code is outdated in this jupyter notebook:
template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace( "{tool_definitions}", "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code", )
However, I did manage to run the llm_classify method, however, this is not successfully logged into Phoenix and I dont know why. I use this code, whereas the dataframe is pasted in the photo but I dont see it in the app as I should. I see the traces from calling the agent and also there is information about running the llm evaluation in the phoenix app as you see it in the pasted picture, however, I dont see it anywhere this sign that would suggest the percentage of how successful was the evaluation:
Tea
·
basically eval labels are nowhere to be seen:
Tea
·
I totally understand this balance between the ease of use and customizabiliy-makes sense. However, sometimes it's a bit confusing what phoenix needs in order to work like in my current case with the evals🥲 Unfortunately even when i switched to "all" in toggle I cant see the eval scores. Do you have any other idea what could I do? I ran also the example provided by you Phoenix guys just to see if it doesnt work for me on general but the evals from this jupyter notebook (which I followed with implementation: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4) did work for me. So something is off in my case
Tea
·
I imported my dataframe into the jupiter notebook that I liked above to demonstrate that the dataframe tool_call_evals in my case and in the experiment provided by Phoenix have the same structure. I really dont know what else could go wrong here when calling this and trying to log results back to Phoenix: px.Client().log_evaluations( SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval), )
Tea
·
No worries, I appreciate your help so much! I have to present the Phoenix evals on openAI SDK tomorrow afternoon so I am still working on figuring this out:D I was just guessing the same, however I can't figure it out where that context.span_id is usually created. I am using openAI SDK so for the traces all I need to do is: tracer_provider = register( project_name="project_name" auto_instrument=True OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider) However, later I am calling just this agent_output = await Runner.run(triage_agent, prompt) and I can see traces in the app. But the thing is that apparently this creates context.span_id somewhere in the background and I dont know how to access it. Those context.span_ids you saw above are indeed wrong-I created them manually back then when I didnt know that I actually have to "catch" them somewhere in the code while creating pandas dataframe (lets call it X) from the agent_output. That X is then later passed to the llm_classify function and yeah, if content.span_id is not created as it should (while running the agent) then the evals are nowhere to be seen. Do you maybe have any suggestion how could I create/catch those span_ids?
Tea
·
thank you so much John! I am on it! Thank you again❤️
🙌1
Tea
·
I have another question that is not directly connected to the issue I had above, however it is a result of testing the evals above. Those evals work great if LLM as a judge has to make only one or more parallel calls, however, if there needs to be two steps done, for instance tool A has to be called and then tool B has to be called, then the second call is always going to be labelled as incorrect. For instance I have a tool A (which gets the temperature of a chosen city) and a tool B which is a calculator. My question for instance was: "Get me the temperature of NY and London and sum them up". For this twice get_temperature should be called and then calculator to sum up the temperatures. The row relating to this LLM call in this case correctly got correct label: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_lfpBDHa', 'function': {'name': 'get_temperature', 'arguments': '{"city":"NY"}'}}}, {'tool_call': {'id': 'call_1oHhZs', 'function': {'name': 'get_temperature', 'arguments': '{"city":"London"}'}}}], 'role': 'assistant'}}] However, the call to LLM that is done after the one above is this one: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_LhbogWS', 'function': {'name': 'calculator_tool', 'arguments': '{"equation":"14 + 18"}'}}}], 'role': 'assistant'}}] Which is correct but the LLM as a judge is going to label it with incorrect label. The explanation is as follows:
The tool called is the calculator_tool, which is intended for performing calculations. However, the question specifically asks for the current temperature in NY and London and their sum, which requires the use of the get_temperature to retrieve the temperature first. The correct tool to answer the question is not the calculator_tool, but rather the get_temperature, which was correctly called for both cities. Therefore, the choice of the calculator_tool is incorrect as it does not directly answer the question without the prior price retrieval.
So basically the question is: does the evaluation with the LLM as a judge allows somehow to evaluate also the agents that need multiple steps/multiple sequential steps to the end answer? And if it does allow for that how do I do that? Thank you in advance for your answer again!
Tea
·
I thought so, I just needed to check if there is any functionality available in phoenix that already expects that. I'll do it this way then, it shouldn't be too hard. Thanks!:)
👍1
🙌1

RunLLM
APP
·
To implement function calling evaluations using LLM as a judge, you can use the `llm_classify` function from Phoenix. This function allows you to evaluate whether the correct tool was used by the agent. Here's a basic outline of how you can set it up: 1. *Define the Evaluation Template*: Create a prompt template that specifies the criteria for evaluating the tool usage. This template should include the question, the tool called, and the expected output. 2. *Set Up the Evaluation Model*: Use an evaluation model like `OpenAIModel` to perform the classification. 3. *Run the Evaluation*: Use the `llm_classify` function to classify the tool usage based on the defined template and model. Here's a simplified example: ``` from phoenix.evals import llm_classify, OpenAIModel # Define your evaluation template TOOL_CALLING_PROMPT_TEMPLATE = """ You are evaluating whether the correct tool was used for the question. [Question]: {question} [Tool Called]: {tool_call} Your response must be "correct" or "incorrect". """ # Set up the evaluation model eval_model = OpenAIModel(model_name="gpt-4o-2024-08-06") # Create a dataframe with the data to evaluate import pandas as pd data = { "question": ["How much is 28763+8763?"], "tool_call": ["calculator_tool"] } df = pd.DataFrame(data) # Run the evaluation results = llm_classify( dataframe=df, template=TOOL_CALLING_PROMPT_TEMPLATE, model=eval_model, rails=["correct", "incorrect"] ) print(results) ``` This code sets up a basic evaluation to check if the correct tool was used for the given question. Adjust the template and data as needed for your specific use case. Would you like more detailed guidance on any specific part of this process? Sources: - <https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/tool-calling-eval|Agent Function Calling Eval>
💡 Hint: Mention RunLLM in the thread for followups.
Tea
·
Hi John, at the end I managed to do it. It was difficult to undestand how to export the outputs of the OpenAI SDK into the dataframe which could be then run in llm_classify method. Also this part of the code is outdated in this jupyter notebook:
template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace( "{tool_definitions}", "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code", )
However, I did manage to run the llm_classify method, however, this is not successfully logged into Phoenix and I dont know why. I use this code, whereas the dataframe is pasted in the photo but I dont see it in the app as I should. I see the traces from calling the agent and also there is information about running the llm evaluation in the phoenix app as you see it in the pasted picture, however, I dont see it anywhere this sign that would suggest the percentage of how successful was the evaluation:
Tea
·
basically eval labels are nowhere to be seen:
Tea
·
I totally understand this balance between the ease of use and customizabiliy-makes sense. However, sometimes it's a bit confusing what phoenix needs in order to work like in my current case with the evals🥲 Unfortunately even when i switched to "all" in toggle I cant see the eval scores. Do you have any other idea what could I do? I ran also the example provided by you Phoenix guys just to see if it doesnt work for me on general but the evals from this jupyter notebook (which I followed with implementation: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4) did work for me. So something is off in my case
Tea
·
I imported my dataframe into the jupiter notebook that I liked above to demonstrate that the dataframe tool_call_evals in my case and in the experiment provided by Phoenix have the same structure. I really dont know what else could go wrong here when calling this and trying to log results back to Phoenix: px.Client().log_evaluations( SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval), )
Tea
·
No worries, I appreciate your help so much! I have to present the Phoenix evals on openAI SDK tomorrow afternoon so I am still working on figuring this out:D I was just guessing the same, however I can't figure it out where that context.span_id is usually created. I am using openAI SDK so for the traces all I need to do is: tracer_provider = register( project_name="project_name" auto_instrument=True OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider) However, later I am calling just this agent_output = await Runner.run(triage_agent, prompt) and I can see traces in the app. But the thing is that apparently this creates context.span_id somewhere in the background and I dont know how to access it. Those context.span_ids you saw above are indeed wrong-I created them manually back then when I didnt know that I actually have to "catch" them somewhere in the code while creating pandas dataframe (lets call it X) from the agent_output. That X is then later passed to the llm_classify function and yeah, if content.span_id is not created as it should (while running the agent) then the evals are nowhere to be seen. Do you maybe have any suggestion how could I create/catch those span_ids?
Tea
·
thank you so much John! I am on it! Thank you again❤️
🙌1
Tea
·
I have another question that is not directly connected to the issue I had above, however it is a result of testing the evals above. Those evals work great if LLM as a judge has to make only one or more parallel calls, however, if there needs to be two steps done, for instance tool A has to be called and then tool B has to be called, then the second call is always going to be labelled as incorrect. For instance I have a tool A (which gets the temperature of a chosen city) and a tool B which is a calculator. My question for instance was: "Get me the temperature of NY and London and sum them up". For this twice get_temperature should be called and then calculator to sum up the temperatures. The row relating to this LLM call in this case correctly got correct label: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_lfpBDHa', 'function': {'name': 'get_temperature', 'arguments': '{"city":"NY"}'}}}, {'tool_call': {'id': 'call_1oHhZs', 'function': {'name': 'get_temperature', 'arguments': '{"city":"London"}'}}}], 'role': 'assistant'}}] However, the call to LLM that is done after the one above is this one: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_LhbogWS', 'function': {'name': 'calculator_tool', 'arguments': '{"equation":"14 + 18"}'}}}], 'role': 'assistant'}}] Which is correct but the LLM as a judge is going to label it with incorrect label. The explanation is as follows:
The tool called is the calculator_tool, which is intended for performing calculations. However, the question specifically asks for the current temperature in NY and London and their sum, which requires the use of the get_temperature to retrieve the temperature first. The correct tool to answer the question is not the calculator_tool, but rather the get_temperature, which was correctly called for both cities. Therefore, the choice of the calculator_tool is incorrect as it does not directly answer the question without the prior price retrieval.
So basically the question is: does the evaluation with the LLM as a judge allows somehow to evaluate also the agents that need multiple steps/multiple sequential steps to the end answer? And if it does allow for that how do I do that? Thank you in advance for your answer again!
Tea
·
I thought so, I just needed to check if there is any functionality available in phoenix that already expects that. I'll do it this way then, it shouldn't be too hard. Thanks!:)
👍1
🙌1