I thought so, I just needed to check if there is any functionality available in phoenix that already expects that. I'll do it this way then, it shouldn't be too hard. Thanks!:)
I have another question that is not directly connected to the issue I had above, however it is a result of testing the evals above. Those evals work great if LLM as a judge has to make only one or more parallel calls, however, if there needs to be two steps done, for instance tool A has to be called and then tool B has to be called, then the second call is always going to be labelled as incorrect. For instance I have a tool A (which gets the temperature of a chosen city) and a tool B which is a calculator. My question for instance was: "Get me the temperature of NY and London and sum them up". For this twice get_temperature should be called and then calculator to sum up the temperatures. The row relating to this LLM call in this case correctly got correct label: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_lfpBDHa', 'function': {'name': 'get_temperature', 'arguments': '{"city":"NY"}'}}}, {'tool_call': {'id': 'call_1oHhZs', 'function': {'name': 'get_temperature', 'arguments': '{"city":"London"}'}}}], 'role': 'assistant'}}] However, the call to LLM that is done after the one above is this one: [{'message': {'tool_calls': [{'tool_call': {'id': 'call_LhbogWS', 'function': {'name': 'calculator_tool', 'arguments': '{"equation":"14 + 18"}'}}}], 'role': 'assistant'}}] Which is correct but the LLM as a judge is going to label it with incorrect label. The explanation is as follows:
The tool called is the calculator_tool, which is intended for performing calculations. However, the question specifically asks for the current temperature in NY and London and their sum, which requires the use of the get_temperature to retrieve the temperature first. The correct tool to answer the question is not the calculator_tool, but rather the get_temperature, which was correctly called for both cities. Therefore, the choice of the calculator_tool is incorrect as it does not directly answer the question without the prior price retrieval.So basically the question is: does the evaluation with the LLM as a judge allows somehow to evaluate also the agents that need multiple steps/multiple sequential steps to the end answer? And if it does allow for that how do I do that? Thank you in advance for your answer again!
thank you so much John! I am on it! Thank you again鉂わ笍
No worries, I appreciate your help so much! I have to present the Phoenix evals on openAI SDK tomorrow afternoon so I am still working on figuring this out:D I was just guessing the same, however I can't figure it out where that context.span_id is usually created. I am using openAI SDK so for the traces all I need to do is: tracer_provider = register( project_name="project_name" auto_instrument=True OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider) However, later I am calling just this agent_output = await Runner.run(triage_agent, prompt) and I can see traces in the app. But the thing is that apparently this creates context.span_id somewhere in the background and I dont know how to access it. Those context.span_ids you saw above are indeed wrong-I created them manually back then when I didnt know that I actually have to "catch" them somewhere in the code while creating pandas dataframe (lets call it X) from the agent_output. That X is then later passed to the llm_classify function and yeah, if content.span_id is not created as it should (while running the agent) then the evals are nowhere to be seen. Do you maybe have any suggestion how could I create/catch those span_ids?
I imported my dataframe into the jupiter notebook that I liked above to demonstrate that the dataframe tool_call_evals in my case and in the experiment provided by Phoenix have the same structure. I really dont know what else could go wrong here when calling this and trying to log results back to Phoenix: px.Client().log_evaluations( SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval), )
I totally understand this balance between the ease of use and customizabiliy-makes sense. However, sometimes it's a bit confusing what phoenix needs in order to work like in my current case with the evals馃ゲ Unfortunately even when i switched to "all" in toggle I cant see the eval scores. Do you have any other idea what could I do? I ran also the example provided by you Phoenix guys just to see if it doesnt work for me on general but the evals from this jupyter notebook (which I followed with implementation: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4) did work for me. So something is off in my case
basically eval labels are nowhere to be seen:
Hi John, at the end I managed to do it. It was difficult to undestand how to export the outputs of the OpenAI SDK into the dataframe which could be then run in llm_classify method. Also this part of the code is outdated in this jupyter notebook:
template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
"{tool_definitions}",
"generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
)However, I did manage to run the llm_classify method, however, this is not successfully logged into Phoenix and I dont know why. I use this code, whereas the dataframe is pasted in the photo but I dont see it in the app as I should. I see the traces from calling the agent and also there is information about running the llm evaluation in the phoenix app as you see it in the pasted picture, however, I dont see it anywhere this sign that would suggest the percentage of how successful was the evaluation:
Hi there, I am trying to implement evals for my agent that is developed with openAI SDK in the same matter such as it is implemented here: https://github.com/Arize-ai/phoenix/blob/4f4ea07adcbdfa21be1be855da273d55515f61c1/tutorials/evals/evaluate_agent.ipynb#L4 But I am not able to do so. I am trying to implement that part in the code: "Function Calling Evals using LLM as a Judge" I am using a triage agent defined like that:
triage_agent = Agent(
name="triage_agent",
instructions="Always use appropriate tool based on the request.",
model="gpt-4o-2024-08-06",
tools=[calculator_tool,some_other_tool],
)and I am running it with
output = await Runner.run(triage_agent, "How much is 28763+8763?")Any help would be greatly appreciated!
