Help Needed: Uploading New Version of Existing Dataset in Phoenix

·Dec 11, 2024 10:51 PM

hey everyone! was just trying to upload a new version of an existing dataset to our Phoenix instance using the upload_dataset function. Here is an example of what I was doing.

dataset = px.Client().upload_dataset(
    dataset_name="test_dataset",
    dataframe=df,
    input_keys=["input"],
    output_keys=["output"],
)

DatasetUploadError: Dataset with the same name already exists: name='test_dataset'

I tried this function (https://docs.arize.com/phoenix/api/client#append_dataset) and that didn't seem to exist as a function. I am probably just missing something in the documentation so could someone point me to how to do this correctly? thanks!

11 comments

· Sorted by Oldest

RunLLM
APP
·
To upload a new version of an existing dataset in Phoenix, you should use the append_to_dataset method instead of upload_dataset. The upload_dataset method is intended for creating new datasets and will throw an error if a dataset with the same name already exists. The append_to_dataset method allows you to add new examples to an existing dataset.
Here's how you can use the append_to_dataset method:
```
dataset = px.Client().append_to_dataset( dataset_name="test_dataset", dataframe=df, input_keys=["input"], output_keys=["output"], )
LBKSThis method will append the new data to the existing dataset named "test_dataset". If the dataset does not exist, it will be created. Would you like more details on this process? Sources: - [Phoenix API Reference](https://arize-phoenix.readthedocs.io/en/latest/api/client.html)
💡 Hint: Mention RunLLM in the thread for followups.
Roger Y.
·
looks like the docs have a typo. can you try append_to_dataset as suggested above?
Tristan B.
·
hey Roger Y.! is there any way to remove examples from the dataset instead of just appending? I ask because our actual use case is that we wanted to keep the same examples in the current dataset but just update the naming of the columns within that dataset to better match our internal processes. Was trying to avoid creating a new dataset with duplicate data when we could just use versioning and point to that new version

Roger Y.

Currently they can only be deleted in the UI, but as a workaround you can mimic the UI and issue a graphql query like the one below

import requests

response = requests.post("http://localhost:6006/graphql", json={
    "query": "mutation($input:DeleteDatasetExamplesInput!){deleteDatasetExamples(input:$input){dataset{id}}}",
    "variables": {
        "input": {
            "exampleIds": [
                "RGF0YXNldEV4YW1wbGU6OTk=",
                "RGF0YXNldEV4YW1wbGU6OTg="
            ]
        }
    },
})
print(response.json())

Roger Y.

but sounds like you want to modify the existing examples? if so, an example UI query to do that is the following

response = requests.post("http://localhost:6006/graphql", json={
    "query": "mutation($input:PatchDatasetExamplesInput!) {patchDatasetExamples(input: $input) {__typename}}",
    "variables": {
        "input": {
            "patches": [
                {
                    "exampleId": "RGF0YXNldEV4YW1wbGU6OTc=",
                    "input": {
                        "test": 123,
                    },
                    "output": {
                        "test": 123,
                    },
                    "metadata": {}
                }
            ]
        }
    },
})
print(response.json())

Tristan B.
·
okay great this is directionally correct! just want to confirm. looks like I can basically add in all the updated examples via that list of example dicts and once I post this to the server, it will generate a single new version id that I can reference?
Roger Y.
·
yes. but you should try it on a toy dataset first
Tristan B.
·
got it! will try now
👍1
Tristan B.
·
I just generated a new toy dataset and then generated the list of patches with the updated columns/names and sent off the request. It indeed outputted one version_id for all examples changes in the post request. will try now on our real data :)
1
Tristan B.
·
Roger Y. Was able to migrate a real dataset to the new format and use it for our downstream tasks! Thank you very much for your quick response and assistance 🙂
Roger Y.
·
You’re welcome. Thanks for using Phoenix!

RunLLM
APP
·
To upload a new version of an existing dataset in Phoenix, you should use the append_to_dataset method instead of upload_dataset. The upload_dataset method is intended for creating new datasets and will throw an error if a dataset with the same name already exists. The append_to_dataset method allows you to add new examples to an existing dataset.
Here's how you can use the append_to_dataset method:
```
dataset = px.Client().append_to_dataset( dataset_name="test_dataset", dataframe=df, input_keys=["input"], output_keys=["output"], )
LBKSThis method will append the new data to the existing dataset named "test_dataset". If the dataset does not exist, it will be created. Would you like more details on this process? Sources: - [Phoenix API Reference](https://arize-phoenix.readthedocs.io/en/latest/api/client.html)
💡 Hint: Mention RunLLM in the thread for followups.
Roger Y.
·
looks like the docs have a typo. can you try append_to_dataset as suggested above?
Tristan B.
·
hey Roger Y.! is there any way to remove examples from the dataset instead of just appending? I ask because our actual use case is that we wanted to keep the same examples in the current dataset but just update the naming of the columns within that dataset to better match our internal processes. Was trying to avoid creating a new dataset with duplicate data when we could just use versioning and point to that new version

Roger Y.

Currently they can only be deleted in the UI, but as a workaround you can mimic the UI and issue a graphql query like the one below

import requests

response = requests.post("http://localhost:6006/graphql", json={
    "query": "mutation($input:DeleteDatasetExamplesInput!){deleteDatasetExamples(input:$input){dataset{id}}}",
    "variables": {
        "input": {
            "exampleIds": [
                "RGF0YXNldEV4YW1wbGU6OTk=",
                "RGF0YXNldEV4YW1wbGU6OTg="
            ]
        }
    },
})
print(response.json())

Roger Y.

but sounds like you want to modify the existing examples? if so, an example UI query to do that is the following

response = requests.post("http://localhost:6006/graphql", json={
    "query": "mutation($input:PatchDatasetExamplesInput!) {patchDatasetExamples(input: $input) {__typename}}",
    "variables": {
        "input": {
            "patches": [
                {
                    "exampleId": "RGF0YXNldEV4YW1wbGU6OTc=",
                    "input": {
                        "test": 123,
                    },
                    "output": {
                        "test": 123,
                    },
                    "metadata": {}
                }
            ]
        }
    },
})
print(response.json())

Tristan B.
·
okay great this is directionally correct! just want to confirm. looks like I can basically add in all the updated examples via that list of example dicts and once I post this to the server, it will generate a single new version id that I can reference?
Roger Y.
·
yes. but you should try it on a toy dataset first
Tristan B.
·
got it! will try now
👍1
Tristan B.
·
I just generated a new toy dataset and then generated the list of patches with the updated columns/names and sent off the request. It indeed outputted one version_id for all examples changes in the post request. will try now on our real data :)
1
Tristan B.
·
Roger Y. Was able to migrate a real dataset to the new format and use it for our downstream tasks! Thank you very much for your quick response and assistance 🙂
Roger Y.
·
You’re welcome. Thanks for using Phoenix!