Using the run_evals Function with Custom Evaluators Explained

·Dec 10, 2024 06:34 PM

i have a question about the run_evals function - does it only accept built-in evaluators, or is it possible to use this and specify multiple custom evals?

18 comments

· Sorted by Oldest

RunLLM
APP
·
The run_evals function in Phoenix can accept both built-in and custom evaluators. You can specify multiple custom evaluators by creating them as Python functions and passing them to run_evals. These evaluators can be designed to assess various aspects of your data, such as relevance, hallucinations, or toxicity, and they will return labels, scores, and optional explanations for each row in the input dataframe.
Would you like more detail on how to create and use custom evaluators with run_evals?
Sources: - Phoenix API Evals
💡 Hint: Mention RunLLM in the thread for followups.
Hain-Lee H.
·
It's not clear to me how to create custom evaluatiors as python functions
Hain-Lee H.
·
thanks! i'll give it a try
Hain-Lee H.
·
hi Mikyo, wanted to get your input on the use cases for experiments and custom evals (phoenix.evals) and when you would use either or both. This might be a bit winded so I apologize if it's hard to follow. I have a task i'd like an LLM to do which involves taking some text content and asking it to extract a specific part of the text (e.g. given the text of a SMART goal, give me back only the portion that describes the measurement criteria used). I have a starting prompt that can perform this task, as well as a sample of goal texts to use as input. What's the best way to then evaluate the performance of this LLM task using an LLM? At first, I thought I needed to write a custom LLM eval that would be given the input goal text, the extracted excerpt, and then return "correct" or "incorrect". Per the docs, to create a custom LLM eval I need to create a groundtruth dataset, which in my case consists of the input goal text, an excerpt, and a groundtruth label of correct/incorrect. Then I run llm_classify with my eval prompt template on this dataset, and compare the output of the eval with the groundtruth labels to measure how good the custom eval is. Basically, to create a custom eval for my original task, I have to evaluate my custom eval using groundtruth data. Then assuming the custom eval performs well enough on the groundtruth data, I can use it as an eval for the original task. Where would datasets and experiments come in this workflow? Would I need to run experiments for the custom eval prompt by using a dataset that consists of the input, excerpt, and groundtruth label, and write an eval function to see if the generated label matches the groundtruth label? Then once I run enough experiments to have a good eval prompt, then I can use that eval prompt as an eval function for an experiment that runs the original task of extracting the excerpt from goal text? Thanks for helping me think/walk through what this process looks like.
Mikyo
·
Hey Hain-Lee H. this is a very good question, If I understand your scenario correctly I would definitely push you down the direction of using datasets and experiments and would suggest starting without the use of an LLM for evaluation. This means your dataset would look like:
[ { input: sometext, output: expected_text } ]
where the output section is the expected - e.g. the part of the text that you want the llm to extract. You can then write a simple "code" evaluator like so:
def exact_match(output, expected): return output == expected
If you want to allow for some fault tolerance you can use libraries like autoevals or things like rouge score too.
Mikyo
·
Here are some examples of how you can use experiments https://docs.arize.com/phoenix/datasets-and-experiments/use-cases-datasets
Mikyo
·
You can certainly use an LLM as a judge too but sometimes starting without can get your data in the right shape
Mikyo
·
And as you mention, llm as a judge requires some benchmarking, which can take time to get past the "vibe" check level
Hain-Lee H.
·
got it, so sounds like for the task itself it's best to have groundtruth dataset and run experiments over that to start with, rather than to work backwards by trying to create a custom eval first. would the custom eval for the task be more for monitoring then? and to craft an llm as a judge eval itself we would leverage the experiment/dataset workflow like i described?
Mikyo
·
LLM as a judge is useful for both monitoring and experimentation - I think if you get a good judge setup that template or model can be re-used for both.
Mikyo
·
the judge can catch issues as you develop, and the judge can be used to monitor.
Hain-Lee H.
·
would you recommend using experiments and datasets to develop the llm judge itself?
Hain-Lee H.
·
or at this stage is it more like "you can, and we're still figuring out the easiest ways for creating an llm judge"? i know there's the documented conceptual steps of creating a judge, but it doesn't show what do actually do for step 5 so i've just been writing my own code to run llm_classify in a loop
Mikyo
·
Hain-Lee H. that’s a good call out. Experiments are a great way to develop judges. Something we should highlight. Cc John G.
👀1
Mikyo
·
Building judges using ideal human labels is the best
👍1

RunLLM
APP
·
The run_evals function in Phoenix can accept both built-in and custom evaluators. You can specify multiple custom evaluators by creating them as Python functions and passing them to run_evals. These evaluators can be designed to assess various aspects of your data, such as relevance, hallucinations, or toxicity, and they will return labels, scores, and optional explanations for each row in the input dataframe.
Would you like more detail on how to create and use custom evaluators with run_evals?
Sources: - Phoenix API Evals
💡 Hint: Mention RunLLM in the thread for followups.
Hain-Lee H.
·
It's not clear to me how to create custom evaluatiors as python functions
Hain-Lee H.
·
thanks! i'll give it a try
Hain-Lee H.
·
hi Mikyo, wanted to get your input on the use cases for experiments and custom evals (phoenix.evals) and when you would use either or both. This might be a bit winded so I apologize if it's hard to follow. I have a task i'd like an LLM to do which involves taking some text content and asking it to extract a specific part of the text (e.g. given the text of a SMART goal, give me back only the portion that describes the measurement criteria used). I have a starting prompt that can perform this task, as well as a sample of goal texts to use as input. What's the best way to then evaluate the performance of this LLM task using an LLM? At first, I thought I needed to write a custom LLM eval that would be given the input goal text, the extracted excerpt, and then return "correct" or "incorrect". Per the docs, to create a custom LLM eval I need to create a groundtruth dataset, which in my case consists of the input goal text, an excerpt, and a groundtruth label of correct/incorrect. Then I run llm_classify with my eval prompt template on this dataset, and compare the output of the eval with the groundtruth labels to measure how good the custom eval is. Basically, to create a custom eval for my original task, I have to evaluate my custom eval using groundtruth data. Then assuming the custom eval performs well enough on the groundtruth data, I can use it as an eval for the original task. Where would datasets and experiments come in this workflow? Would I need to run experiments for the custom eval prompt by using a dataset that consists of the input, excerpt, and groundtruth label, and write an eval function to see if the generated label matches the groundtruth label? Then once I run enough experiments to have a good eval prompt, then I can use that eval prompt as an eval function for an experiment that runs the original task of extracting the excerpt from goal text? Thanks for helping me think/walk through what this process looks like.
Mikyo
·
Hey Hain-Lee H. this is a very good question, If I understand your scenario correctly I would definitely push you down the direction of using datasets and experiments and would suggest starting without the use of an LLM for evaluation. This means your dataset would look like:
[ { input: sometext, output: expected_text } ]
where the output section is the expected - e.g. the part of the text that you want the llm to extract. You can then write a simple "code" evaluator like so:
def exact_match(output, expected): return output == expected
If you want to allow for some fault tolerance you can use libraries like autoevals or things like rouge score too.
Mikyo
·
Here are some examples of how you can use experiments https://docs.arize.com/phoenix/datasets-and-experiments/use-cases-datasets
Mikyo
·
You can certainly use an LLM as a judge too but sometimes starting without can get your data in the right shape
Mikyo
·
And as you mention, llm as a judge requires some benchmarking, which can take time to get past the "vibe" check level
Hain-Lee H.
·
got it, so sounds like for the task itself it's best to have groundtruth dataset and run experiments over that to start with, rather than to work backwards by trying to create a custom eval first. would the custom eval for the task be more for monitoring then? and to craft an llm as a judge eval itself we would leverage the experiment/dataset workflow like i described?
Mikyo
·
LLM as a judge is useful for both monitoring and experimentation - I think if you get a good judge setup that template or model can be re-used for both.
Mikyo
·
the judge can catch issues as you develop, and the judge can be used to monitor.
Hain-Lee H.
·
would you recommend using experiments and datasets to develop the llm judge itself?
Hain-Lee H.
·
or at this stage is it more like "you can, and we're still figuring out the easiest ways for creating an llm judge"? i know there's the documented conceptual steps of creating a judge, but it doesn't show what do actually do for step 5 so i've just been writing my own code to run llm_classify in a loop
Mikyo
·
Hain-Lee H. that’s a good call out. Experiments are a great way to develop judges. Something we should highlight. Cc John G.
👀1
Mikyo
·
Building judges using ideal human labels is the best
👍1