Jason - This is the exact same problem I am facing at my work too. Working on an LLM chatbot that has to choose between multiple functions before making the final decision and it almost always misses that even multiple runs of the same function.
I am exploring this new technique by langchain called reflexion. The idea is to have a revisor that critiques the choice of tools and functions until a criteria is met. This is a very GAN like framework of having a discriminator and generator but seems intuitive. They are currently using prompts like below to do it but I am sure there is a more eval driven way to do it.
"system",
"""You are expert researcher.
Current time: {time}
1. {first_instruction}
2. Reflect and critique your answer. Be severe to maximize improvement.
3. Recommend search queries to research information and improve your answer."""
Any thoughts?