Challenges with Probabilistic Function Calls in GPT-4 Models

·Feb 22, 2024 03:55 PM

I'm feeling this point this week wanted to drop it to the community. I've been working on a use-case that heavily calls functions probabilistically (real function use case). Working with GPT-4 and best models out there, its still very hard to control the flow. I add in a line to explicitly call the function when X occurs, its hit and miss. Feels very brittle around instruction following when those instructions are around function call backs. This does not bode well for anyone trying to build true general agents right now. The models feel just "not" there yet is my current take.

8 comments

· Sorted by Oldest

Manas S.
·
Jason - This is the exact same problem I am facing at my work too. Working on an LLM chatbot that has to choose between multiple functions before making the final decision and it almost always misses that even multiple runs of the same function. I am exploring this new technique by langchain called reflexion. The idea is to have a revisor that critiques the choice of tools and functions until a criteria is met. This is a very GAN like framework of having a discriminator and generator but seems intuitive. They are currently using prompts like below to do it but I am sure there is a more eval driven way to do it.
"system", """You are expert researcher. Current time: {time} 1. {first_instruction} 2. Reflect and critique your answer. Be severe to maximize improvement. 3. Recommend search queries to research information and improve your answer."""
Any thoughts?
Adam M.
·
Feels like a potential use case for instruction tuned models
Vibhu S.
·
Are you using an orchestrator agent or anything? Typically prompting won’t be enough to get an agent to call functions at the right time but separating out the tool / function calling and explicitly adding a step where a separate model / prompt / agent runs to analyze the current “state” will often lead to much better results. Without knowing your use exact use case - a general example would be to add a linear step to the end of your output where you make a new call to GPT-4 and pass in stuff like the objective, current state (“chat history” so far), proposed next step, etc. and in this step explicitly ask it to predict if a call to a function / tool is necessary or not. This is a pretty boilerplate example using an LLM but it could really be as basic as a string match for a term always means to use some tool but the real takeaway is explicitly separating this step outside of an analysis prompt seems to be work a lot better in my experience.
Vibhu S.
·
Oh also just saw a paper on this from Meta yesterday that you may find interesting. I haven’t read it but this twitter summary seems relevant - https://twitter.com/jaseweston/status/1760840038055579718
Jason
·
Vibhu S. it’s an interesting idea, I think if I we get to a spot where we hit a wall might try it. Just might be cognitively asking too much from a single call, in general testing an analysis and a function call. The LLM suggests the function call “you should call Y” but won’t make it, without a please. Sometimes it’s so explicit user says X, please call Y and it doesn’t do it sometimes.
Jason
·
I saw that Facebook paper need to read
Jason
·
It’s just felt like to me the probabilistic decision to call the function is just much more flaky than most GPT-4 results
Mikyo
·
Berkley has a leaderboard. Best accuracy for function calling is still 80% https://gorilla.cs.berkeley.edu/leaderboard.html

Manas S.
·
Jason - This is the exact same problem I am facing at my work too. Working on an LLM chatbot that has to choose between multiple functions before making the final decision and it almost always misses that even multiple runs of the same function. I am exploring this new technique by langchain called reflexion. The idea is to have a revisor that critiques the choice of tools and functions until a criteria is met. This is a very GAN like framework of having a discriminator and generator but seems intuitive. They are currently using prompts like below to do it but I am sure there is a more eval driven way to do it.
"system", """You are expert researcher. Current time: {time} 1. {first_instruction} 2. Reflect and critique your answer. Be severe to maximize improvement. 3. Recommend search queries to research information and improve your answer."""
Any thoughts?
Adam M.
·
Feels like a potential use case for instruction tuned models
Vibhu S.
·
Are you using an orchestrator agent or anything? Typically prompting won’t be enough to get an agent to call functions at the right time but separating out the tool / function calling and explicitly adding a step where a separate model / prompt / agent runs to analyze the current “state” will often lead to much better results. Without knowing your use exact use case - a general example would be to add a linear step to the end of your output where you make a new call to GPT-4 and pass in stuff like the objective, current state (“chat history” so far), proposed next step, etc. and in this step explicitly ask it to predict if a call to a function / tool is necessary or not. This is a pretty boilerplate example using an LLM but it could really be as basic as a string match for a term always means to use some tool but the real takeaway is explicitly separating this step outside of an analysis prompt seems to be work a lot better in my experience.
Vibhu S.
·
Oh also just saw a paper on this from Meta yesterday that you may find interesting. I haven’t read it but this twitter summary seems relevant - https://twitter.com/jaseweston/status/1760840038055579718
Jason
·
Vibhu S. it’s an interesting idea, I think if I we get to a spot where we hit a wall might try it. Just might be cognitively asking too much from a single call, in general testing an analysis and a function call. The LLM suggests the function call “you should call Y” but won’t make it, without a please. Sometimes it’s so explicit user says X, please call Y and it doesn’t do it sometimes.
Jason
·
I saw that Facebook paper need to read
Jason
·
It’s just felt like to me the probabilistic decision to call the function is just much more flaky than most GPT-4 results
Mikyo
·
Berkley has a leaderboard. Best accuracy for function calling is still 80% https://gorilla.cs.berkeley.edu/leaderboard.html