HI!
I want to set up some custom evaluators to compare the outputs of different prompt versions. How can set up evaluators in the the UI itself? I'm using golden datasets for this.
Thanks!
To set up custom evaluators in Phoenix, you can use the Prompt Playground to test and compare different prompt versions. The Playground allows you to run prompts over datasets and evaluate their performance. You can create custom evaluators by writing Python functions that return a boolean or numeric value, which will be recorded as the evaluation score. These evaluators can be used to assess the outputs of different prompt versions.For more detailed guidance, you can refer to the Phoenix documentation on using evaluators and testing prompts in the Playground. Would you like more detail on setting up these evaluators or using the Playground?Sources:
- Testing a Prompt in the Playground
- Phoenix documentation on using evaluators
馃挕聽Hint: Mention RunLLM in the thread for followups.
Hey Stuti P. server side evaluators are on the roadmap but not shipped. John G. has some talks that can show you how you can just run evaluators in a notebook to update your playground experiments in the meantime.