Testing Phoenix: Issues with evaluate_experiment Function

·Aug 09, 2024 04:44 PM

Hey team, been testing Phoenix this week and loving it so far! ⭐ I’m running into some issues when trying to call evaluate_experiment for an experiment that was previously ran. Code and exception in 🧵

❤️5

⭐2

✅1

9 comments

· Sorted by Oldest

RunLLM
APP
·
The evaluate_experiment function is used to add additional evaluations to an already run experiment. Ensure that the experiment object passed to evaluate_experiment is the same one returned by run_experiment. If you encounter an exception, it might be due to incorrect parameters or an issue with the evaluators.
Would you like more detail on defining tasks and evaluators or troubleshooting the specific exception you're encountering?
Sources: - Phoenix API Reference
💡 Hint: Mention RunLLM in the thread for followups.
Anton A.
·
Here’s my code:
from phoenix.experiments import evaluate_experiment from phoenix.experiments.types import Experiment exp = Experiment( id="RXhwZXJpbWVudDoxMA", # retrieved from UI dataset_id="RGF0YXNldDoz", # retrieved from UI dataset_version_id="RGF0YXNldFZlcnNpb246Mw==", # retrieved from UI repetitions=1, project_name="" ) evaluate_experiment( exp, evaluators=[evaluator], dry_run=True )
Exception:
KeyError: ‘id’ File <command-1409731535359352>, line 11 2 from phoenix.experiments.types import Experiment 4 exp = Experiment( 5 id=“RXhwZXJpbWVudDoxMA”, # mistral large prompt 3 6 dataset_id=“RGF0YXNldDoz”, (...) 9 project_name=“” 10 ) ---> 11 evaluate_experiment( 12 exp, 13 evaluators=[is_correct_direct], 14 dry_run=True 15 ) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/functions.py:442, in evaluate_experiment(experiment, evaluators, dry_run, print_summary, rate_limit_errors) 440 ran_experiment: RanExperiment = experiment 441 else: --> 442 dataset = Dataset.from_dict( 443 sync_client.get( 444 f”/v1/datasets/{dataset_id}/examples”, 445 params={“version_id”: str(dataset_version_id)}, 446 ).json()[“data”] 447 ) 448 if not dataset.examples: 449 raise ValueError(f”Dataset has no examples: {dataset_id=}, {dataset_version_id=}“) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/types.py:170, in Dataset.from_dict(cls, obj) 166 @classmethod 167 def from_dict(cls, obj: Mapping[str, Any]) -> Dataset: 168 examples = tuple(map(Example.from_dict, obj.get(“examples”) or ())) 169 return cls( --> 170 id=obj[“id”], 171 version_id=obj[“version_id”], 172 examples={ex.id: ex for ex in examples},
173 )
Anton A.
·
Is it not possible to run evaluate_experiment for an experiment that wasn’t directly returned from run_experiment? I’ve already ran many experiments in parallel and would like to evaluate something retroactively without rerunning old experiments
Dustin N.
·
Hi Anton A., thanks so much for trying out Phoenix! At the moment it's probably going to be a little bit annoying to set that up, but I'm planning to add more convenient client methods to grab previously run experiments from the database in the next few days. For the time being you'd probably have to manually recreate the RanExperiment object that's returned from run_experiment , but we're planning on this this becoming much more ergonomic very soon
Roger Y.
·
this is a bug on our end. we’ll get it fixed right away
Anton A.
·
Gotcha, thanks both! Excited to see that this is actively being worked on 🙂
Dustin N.
·
Hey Anton I also responded to your PR, let me know if it'd be possible for you to write your task as a coroutine function to enable concurrency
Anton A.
·
Thanks Dustin N. I’ve responded in the PR
Roger Y.
·
We have fixed the issue with evaluate_experiment in arize-phoenix==4.22.1 Please let me know if you have additional questions. Thanks again for choosing Phoenix!
🙌1
1

RunLLM
APP
·
The evaluate_experiment function is used to add additional evaluations to an already run experiment. Ensure that the experiment object passed to evaluate_experiment is the same one returned by run_experiment. If you encounter an exception, it might be due to incorrect parameters or an issue with the evaluators.
Would you like more detail on defining tasks and evaluators or troubleshooting the specific exception you're encountering?
Sources: - Phoenix API Reference
💡 Hint: Mention RunLLM in the thread for followups.
Anton A.
·
Here’s my code:
from phoenix.experiments import evaluate_experiment from phoenix.experiments.types import Experiment exp = Experiment( id="RXhwZXJpbWVudDoxMA", # retrieved from UI dataset_id="RGF0YXNldDoz", # retrieved from UI dataset_version_id="RGF0YXNldFZlcnNpb246Mw==", # retrieved from UI repetitions=1, project_name="" ) evaluate_experiment( exp, evaluators=[evaluator], dry_run=True )
Exception:
KeyError: ‘id’ File <command-1409731535359352>, line 11 2 from phoenix.experiments.types import Experiment 4 exp = Experiment( 5 id=“RXhwZXJpbWVudDoxMA”, # mistral large prompt 3 6 dataset_id=“RGF0YXNldDoz”, (...) 9 project_name=“” 10 ) ---> 11 evaluate_experiment( 12 exp, 13 evaluators=[is_correct_direct], 14 dry_run=True 15 ) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/functions.py:442, in evaluate_experiment(experiment, evaluators, dry_run, print_summary, rate_limit_errors) 440 ran_experiment: RanExperiment = experiment 441 else: --> 442 dataset = Dataset.from_dict( 443 sync_client.get( 444 f”/v1/datasets/{dataset_id}/examples”, 445 params={“version_id”: str(dataset_version_id)}, 446 ).json()[“data”] 447 ) 448 if not dataset.examples: 449 raise ValueError(f”Dataset has no examples: {dataset_id=}, {dataset_version_id=}“) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-63346d2e-9ea7-4cb2-b982-36a044f67a4c/lib/python3.11/site-packages/phoenix/experiments/types.py:170, in Dataset.from_dict(cls, obj) 166 @classmethod 167 def from_dict(cls, obj: Mapping[str, Any]) -> Dataset: 168 examples = tuple(map(Example.from_dict, obj.get(“examples”) or ())) 169 return cls( --> 170 id=obj[“id”], 171 version_id=obj[“version_id”], 172 examples={ex.id: ex for ex in examples},
173 )
Anton A.
·
Is it not possible to run evaluate_experiment for an experiment that wasn’t directly returned from run_experiment? I’ve already ran many experiments in parallel and would like to evaluate something retroactively without rerunning old experiments
Dustin N.
·
Hi Anton A., thanks so much for trying out Phoenix! At the moment it's probably going to be a little bit annoying to set that up, but I'm planning to add more convenient client methods to grab previously run experiments from the database in the next few days. For the time being you'd probably have to manually recreate the RanExperiment object that's returned from run_experiment , but we're planning on this this becoming much more ergonomic very soon
Roger Y.
·
this is a bug on our end. we’ll get it fixed right away
Anton A.
·
Gotcha, thanks both! Excited to see that this is actively being worked on 🙂
Dustin N.
·
Hey Anton I also responded to your PR, let me know if it'd be possible for you to write your task as a coroutine function to enable concurrency
Anton A.
·
Thanks Dustin N. I’ve responded in the PR
Roger Y.
·
We have fixed the issue with evaluate_experiment in arize-phoenix==4.22.1 Please let me know if you have additional questions. Thanks again for choosing Phoenix!
🙌1
1