Hey, trying to use evals inside experiments in phoenix. Triggering experiment like first SS and the output of the eval looks like 2nd SS. Is there a way to add also add explanation for evaluations on experiment runs? If so, can anyone share an example?
Yes, you can include explanations in evaluations for experiment runs in Phoenix. When defining your evaluator, return a dictionary with keys like label, score, and explanation, or use the EvaluationResult object to include an explanation. This allows explanations to be attached and displayed with each evaluation result in the experiment output.Example (Python): LBKSfrom phoenix.experiments import run_experiment from phoenix.experiments.evaluators import create_evaluatorLBKS@create_evaluator(name="custom_eval", kind="CODE") def my_eval(output, expected): matches = output == expected return { "label": "match" if matches else "no match", "score": 1 if matches else 0, "explanation": "Outputs match exactly." if matches else "Outputs differ." }LBKSexperiment = run_experiment(dataset, task, evaluators=[my_eval])LBKSThis will add explanations to each evaluation in your experiment runs. For more, see Using Evaluators and GitHub example.Would you like more detail or a step-by-step walkthrough?Sources:
- GitHub example: evaluate_experiment.ts
- Using Evaluators
馃挕聽Hint: Mention RunLLM in the thread for followups.
getting this error, when I return an evaluation response as defined by the robot
if you replace all your run_experiment imports with `from phoenix.client.experiments import run_experiment (and are on the latest version of the arize-phoenix-client this should work
cool, thanks. Am i no longer able to run as backround thread? This worked when using run-experiment with arize-phoenix. Also whats the difference between arize-phoenix and arize-phoenix-client?
Is this what happens when you try to Ctrl+C?
No, I am triggering an experiment run, which used to work fine. This is running as a background task, here is the function it is running
what's weird is it works in the background if I switch back to the other import:
The biggest difference between the two is we no longer try to start an event loop if possible in the run_experiment function, instead we provide run_experiment and async_run_experiment if you want to run with concurrency, as maintaining the single entry point was inflexible for many use cases and had a lot of edge cases
I'm not entirely sure why running in a thread isn't working for you, but if you need concurrency can you try using the async version instead?
for now, if you need the original run_experiment instead of returning a dictionary return this object:
from phoenix.client.resources.experiments.types import EvaluationResult
def evaluator(input, reference):
...
return EvaluationResult(
score=1.0,
label="true",
explanation="because I said so"
)I'll investigate why the new client version isn't working in a thread where the old one does
still no 馃槙 happy to hop on a call if it's easier for you
The error you're getting is because the evaluator is returning a dict, so I think you might be running a script that hasn't picked up your evaluator changes
the EvaluationResult object you should be returning (as shown in the screenshot) is a dataclass
