Rerunning Failed Examples in Large Datasets for Experiment Tracking

Jo P. · 2025-06-11T19:37:55.579Z

Hi friends, is there a way to filter failed examples from a specific experiment, rerun them and add results to the same experiment? I have a large dataset with 10k+ examples and LLM calls failed sporadically, it would be great to just iteratively rerun the failed examples until there's none left with error is None .

RunLLM
APP
·
It seems likely that Phoenix supports filtering failed experiment runs (e.g., where error is not None) and viewing them, but there is no explicit mention in the data sources of a built-in method to rerun only failed examples and append results to the same experiment. Filtering by error status is discussed as a desired feature and is possible via UI or code, but rerunning and updating the same experiment is not directly documented—reruns typically create new experiments or runs, not update existing ones (github_issue #5681, github_issue #5015, experiment_runs_filters.ipynb).You may need to filter failed runs, extract their examples, and rerun them as a new experiment or batch, rather than updating the original experiment. Would you like more detail on how to programmatically filter and rerun failed examples?Sources: - github_issue #5681 - experiment_runs_filters.ipynb - github_issue #5015
💡 Hint: Mention RunLLM in the thread for followups.
Jo P.
·
Yes, please show me how to programmatically filter and rerun failed examples. Understood that this would create a different experiment. But is there a workaround for me to "stitch" together two experiment results depends on the error/success status of the individual examples?
Dustin N.
·
hi Jo P. currently we don't have a great way to stitch together experiments, though I understand you desire this functionality. Could you give me a little bit more information about how the failures are happening? Experiments does have some built-in retrying machinery that should alleviate most sporadic issues
Jo P.
·
Hi Dustin N., thanks for the response! My large dataset actually hit a API rate limit, however it had already. processed half of the dataset and I'm reluctant to burn tokens for the succeeded examples.
Jo P.
·
About the the built-in retry mechanism, how many times does it retry?
Dustin N.
·
ah, that's super reasonable. by default it's set to requeue up to 10 times IIRC, but that's becomes ocassionally requests will time out of our maximum time waiting for a response
Jo P.
·
Got it. Should I put up a feature request for this?
Jo P.
·
Any other workarounds would you recommend in the meantime?
Dustin N.
·
I totally understand the need here—we'll have to discuss with the team about the right way to expose this functionality because on the other hand task failure being logged in the experiment is also data
👍1
Dustin N.
·
yes, please file an issue and we'll try our best to prioritize it
Dustin N.
·
moving experiments over to the thin phoenix client is top of mind for us so hopefully we can get to this soon
Jo P.
·
https://github.com/Arize-ai/phoenix/issues/8041
1
Jo P.
·
By moving experiments over to the thin phoenix client , do you mean exposing more methods / attributes to the experiment returned by:
experiment = px.Client().get_experiment(experiment_id=experiment_id)
If so, that'll be great! I naturally looked there but found nothing substantial or useful.
Dustin N.
·
ah sorry for the confusion here! When you install phoenix you also install the arize-phoenix-client package, importable under phoenix.client, this is a newer version of our client with a minimal dependency footprint that we're hoping to bring to parity with all current phoenix features
Dustin N.
·
the idea is we'll have a new interface for experiments under the phoenix.client package that should be (mostly) backwards compatible that's lighter weight dependency-wise while also offering some much needed ergonomics updates

Rerunning Failed Examples in Large Datasets for Experiment Tracking

19 comments

Rerunning Failed Examples in Large Datasets for Experiment Tracking

19 comments