I have another question that is not directly connected to the issue I had above, however it is a result of testing the evals above.
Those evals work great if LLM as a judge has to make only one or more parallel calls, however, if there needs to be two steps done, for instance tool A has to be called and then tool B has to be called, then the second call is always going to be labelled as incorrect.
For instance I have a tool A (which gets the temperature of a chosen city) and a tool B which is a calculator.
My question for instance was: "Get me the temperature of NY and London and sum them up". For this twice get_temperature should be called and then calculator to sum up the temperatures.
The row relating to this LLM call in this case correctly got correct label:
[{'message': {'tool_calls': [{'tool_call': {'id': 'call_lfpBDHa',
'function': {'name': 'get_temperature',
'arguments': '{"city":"NY"}'}}},
{'tool_call': {'id': 'call_1oHhZs',
'function': {'name': 'get_temperature',
'arguments': '{"city":"London"}'}}}],
'role': 'assistant'}}]
However, the call to LLM that is done after the one above is this one:
[{'message': {'tool_calls': [{'tool_call': {'id': 'call_LhbogWS',
'function': {'name': 'calculator_tool',
'arguments': '{"equation":"14 + 18"}'}}}],
'role': 'assistant'}}]
Which is correct but the LLM as a judge is going to label it with incorrect label. The explanation is as follows:
The tool called is the calculator_tool, which is intended for performing calculations. However, the question specifically asks for the current temperature in NY and London and their sum, which requires the use of the get_temperature to retrieve the temperature first. The correct tool to answer the question is not the calculator_tool, but rather the get_temperature, which was correctly called for both cities. Therefore, the choice of the calculator_tool is incorrect as it does not directly answer the question without the prior price retrieval.
So basically the question is: does the evaluation with the LLM as a judge allows somehow to evaluate also the agents that need multiple steps/multiple sequential steps to the end answer? And if it does allow for that how do I do that?
Thank you in advance for your answer again!