Hey team, been testing Phoenix this week and loving it so far! ⭐ I’m running into some issues when trying to call evaluate_experiment for an experiment that was previously ran. Code and exception in 🧵
The evaluate_experiment function is used to add additional evaluations to an already run experiment. Ensure that the experiment object passed to evaluate_experiment is the same one returned by run_experiment. If you encounter an exception, it might be due to incorrect parameters or an issue with the evaluators.Would you like more detail on defining tasks and evaluators or troubleshooting the specific exception you're encountering? Sources:
- Phoenix API Reference
💡 Hint: Mention RunLLM in the thread for followups.
Here’s my code:
from phoenix.experiments import evaluate_experiment
from phoenix.experiments.types import Experiment
exp = Experiment(
id="RXhwZXJpbWVudDoxMA", # retrieved from UI
dataset_id="RGF0YXNldDoz", # retrieved from UI
dataset_version_id="RGF0YXNldFZlcnNpb246Mw==", # retrieved from UI
repetitions=1,
project_name=""
)
evaluate_experiment(
exp,
evaluators=[evaluator],
dry_run=True
)Exception:
KeyError: ‘id’ File <command-1409731535359352>, line 11 2 from phoenix.experiments.types import Experiment 4 exp = Experiment( 5 id=“RXhwZXJpbWVudDoxMA”, # mistral large prompt 3 6 dataset_id=“RGF0YXNldDoz”, (...) 9 project_name=“” 10 ) ---> 11 evaluate_experiment( 12 exp, 13 evaluators=[is_correct_direct], 14 dry_run=True 15 ) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/functions.py:442, in evaluate_experiment(experiment, evaluators, dry_run, print_summary, rate_limit_errors) 440 ran_experiment: RanExperiment = experiment 441 else: --> 442 dataset = Dataset.from_dict( 443 sync_client.get( 444 f”/v1/datasets/{dataset_id}/examples”, 445 params={“version_id”: str(dataset_version_id)}, 446 ).json()[“data”] 447 ) 448 if not dataset.examples: 449 raise ValueError(f”Dataset has no examples: {dataset_id=}, {dataset_version_id=}“) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/types.py:170, in Dataset.from_dict(cls, obj) 166 @classmethod 167 def from_dict(cls, obj: Mapping[str, Any]) -> Dataset: 168 examples = tuple(map(Example.from_dict, obj.get(“examples”) or ())) 169 return cls( --> 170 id=obj[“id”], 171 version_id=obj[“version_id”], 172 examples={ex.id: ex for ex in examples},
173 )
Is it not possible to run evaluate_experiment for an experiment that wasn’t directly returned from run_experiment? I’ve already ran many experiments in parallel and would like to evaluate something retroactively without rerunning old experiments
Hi Anton A., thanks so much for trying out Phoenix! At the moment it's probably going to be a little bit annoying to set that up, but I'm planning to add more convenient client methods to grab previously run experiments from the database in the next few days. For the time being you'd probably have to manually recreate the RanExperiment object that's returned from run_experiment , but we're planning on this this becoming much more ergonomic very soon
this is a bug on our end. we’ll get it fixed right away
Gotcha, thanks both! Excited to see that this is actively being worked on 🙂
Hey Anton I also responded to your PR, let me know if it'd be possible for you to write your task as a coroutine function to enable concurrency
