Saving and Loading Datasets in JSON Format: Common Issues

·Jul 03, 2025 10:36 AM

what’s the best way to save a dataset and load it back up from disk? I saved it like so:

dataset_path = "dataset.json"
dataset = client.get_dataset(name=dataset_name)
df = dataset.as_dataframe()
df.to_json(dataset_path, index=False)

but I can’t load it back up from disk:

-> dataset = Dataset.from_dict(dataset)
(Pdb) dataset = Dataset.from_dict(dataset)
*** KeyError: 'dataset_id'
(Pdb) dataset.keys()
dict_keys(['input', 'output', 'metadata'])

I get KeyError: 'dataset_id'

11 comments

· Sorted by Oldest

RunLLM
APP
·
The best way to save and reload a Phoenix dataset is to use the built-in Phoenix dataset persistence methods, not just saving the DataFrame to JSON. Phoenix datasets include metadata (like id, version_id, etc.) that are not present in a plain DataFrame export, so using `df.to_json()` and then trying to load with `Dataset.from_dict()` will fail due to missing required fields such as `dataset_id` (<https://github.com/Arize-ai/phoenix/blob/main/packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py|Phoenix Client Dataset class>). Instead, use the Phoenix client’s dataset save/load methods or re-upload the DataFrame using `client.upload_dataset` if you want to persist and reload datasets. If you want to persist traces, use `trace_dataset.save(directory=...)` and `TraceDataset.load(id, directory=...)` (<https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/trace/trace_dataset.py|Phoenix Tracing Docs>, <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-tracing/importing-and-exporting-traces/extract-data-from-spans.md|Phoenix Docs: Save All Traces>). There is no supported method to reconstruct a full Phoenix Dataset object from a plain DataFrame or JSON file alone. Would you like a code example for saving and loading datasets or traces using the Phoenix client? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py|Phoenix Client Dataset class> - <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/trace/trace_dataset.py|Phoenix Tracing Docs> - <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-tracing/importing-and-exporting-traces/extract-data-from-spans.md|Phoenix Docs: Save All Traces>
💡 Hint: Mention RunLLM in the thread for followups.
Andrew S.
·
the client (/phoenix/session/client.py ) does not have a save method. client.upload_dataset does not persist the dataset locally. it uploads it to the phoenix server. My question is about persisting the dataset locally and loading it back up locally. the Dataset object (phoenix/experiments/types.py) does not have a .save method either RunLLM
RunLLM
APP
·
You are correct: the Phoenix Python client (`/phoenix/session/client.py`) and the `Dataset` object do not provide a `.save` method for local persistence—only for uploading to the server. For local persistence, Phoenix supports saving and loading trace datasets (not general datasets) using the `TraceDataset.save(directory=...)` and `TraceDataset.load(id, directory=...)` methods, but there is no built-in method to persist a general `Dataset` object locally and reload it as a Phoenix `Dataset` (<https://github.com/Arize-ai/phoenix/pull/2082|source>, <https://github.com/Arize-ai/phoenix/issues/2196|source>). If you need to persist a general dataset locally, you must serialize the DataFrame and schema yourself (e.g., using `df.to_parquet()` and saving schema as JSON), then reconstruct the `Dataset` object manually by loading both and calling `Dataset(df, schema, name)`. There is no official Phoenix API for local dataset persistence and rehydration beyond this pattern. Would you like a code example for this manual approach? Sources: - <https://github.com/Arize-ai/phoenix/pull/2082|https://github.com/Arize-ai/phoenix/pull/2082> - <https://github.com/Arize-ai/phoenix/issues/2196|https://github.com/Arize-ai/phoenix/issues/2196>
💡 Hint: Mention RunLLM in the thread for followups.
Andrew S.
·
yes RunLLM However, these seems strange that there isn’t a public/official api for this
RunLLM
APP
·
You are correct—there is currently no public or official Phoenix API for saving and loading general `Dataset` objects locally; only trace datasets have this feature (<https://github.com/Arize-ai/phoenix/issues/2196|source>). For general datasets, the recommended manual approach is: 1. Save the DataFrame (e.g., as Parquet or CSV) and the schema (as JSON). 2. When reloading, reconstruct the DataFrame and schema, then create the `Dataset` object. Example: ``` import pandas as pd import phoenix as px import json # Save df.to_parquet("dataset.parquet") with open("schema.json", "w") as f: json.dump(schema.__dict__, f) # Load df = pd.read_parquet("dataset.parquet") with open("schema.json") as f: schema_dict = json.load(f) schema = px.Schema(**schema_dict) dataset = px.Dataset(df, schema, name="my_dataset") ``` This is a workaround due to the lack of a built-in API for local dataset persistence. Sources: - <https://github.com/Arize-ai/phoenix/issues/2196|https://github.com/Arize-ai/phoenix/issues/2196>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
Hey Andrew S. - If I understand this correctly you don't want to have to have to use the network to re-load the dataset. I think this makes sense. Have you tried pickle files before? Might be a decent solution until we have something more formal. cc Dustin N. for thinking about this more formally
Andrew S.
·
I didn’t try pickling but I certainly could. Thanks for the suggestion!
Dustin N.
·
Hi Andrew S. this is definitely an oversight—our dataset objects should be fully serializable as JSON (and stored however is most convenient), I’ll make a tracking ticket for you in a bit :)
Andrew S.
·
Sounds good! 🙏🏻 it will be handy
Dustin N.
·
hi Andrew S. this functionality is out for PR here: https://github.com/Arize-ai/phoenix/pull/8453
🙌2
Andrew S.
·
beautiful. I will use this for sure

RunLLM
APP
·
The best way to save and reload a Phoenix dataset is to use the built-in Phoenix dataset persistence methods, not just saving the DataFrame to JSON. Phoenix datasets include metadata (like id, version_id, etc.) that are not present in a plain DataFrame export, so using `df.to_json()` and then trying to load with `Dataset.from_dict()` will fail due to missing required fields such as `dataset_id` (<https://github.com/Arize-ai/phoenix/blob/main/packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py|Phoenix Client Dataset class>). Instead, use the Phoenix client’s dataset save/load methods or re-upload the DataFrame using `client.upload_dataset` if you want to persist and reload datasets. If you want to persist traces, use `trace_dataset.save(directory=...)` and `TraceDataset.load(id, directory=...)` (<https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/trace/trace_dataset.py|Phoenix Tracing Docs>, <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-tracing/importing-and-exporting-traces/extract-data-from-spans.md|Phoenix Docs: Save All Traces>). There is no supported method to reconstruct a full Phoenix Dataset object from a plain DataFrame or JSON file alone. Would you like a code example for saving and loading datasets or traces using the Phoenix client? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py|Phoenix Client Dataset class> - <https://github.com/Arize-ai/phoenix/blob/main/src/phoenix/trace/trace_dataset.py|Phoenix Tracing Docs> - <https://github.com/Arize-ai/phoenix/blob/main/docs/tracing/how-to-tracing/importing-and-exporting-traces/extract-data-from-spans.md|Phoenix Docs: Save All Traces>
💡 Hint: Mention RunLLM in the thread for followups.
Andrew S.
·
the client (/phoenix/session/client.py ) does not have a save method. client.upload_dataset does not persist the dataset locally. it uploads it to the phoenix server. My question is about persisting the dataset locally and loading it back up locally. the Dataset object (phoenix/experiments/types.py) does not have a .save method either RunLLM
RunLLM
APP
·
You are correct: the Phoenix Python client (`/phoenix/session/client.py`) and the `Dataset` object do not provide a `.save` method for local persistence—only for uploading to the server. For local persistence, Phoenix supports saving and loading trace datasets (not general datasets) using the `TraceDataset.save(directory=...)` and `TraceDataset.load(id, directory=...)` methods, but there is no built-in method to persist a general `Dataset` object locally and reload it as a Phoenix `Dataset` (<https://github.com/Arize-ai/phoenix/pull/2082|source>, <https://github.com/Arize-ai/phoenix/issues/2196|source>). If you need to persist a general dataset locally, you must serialize the DataFrame and schema yourself (e.g., using `df.to_parquet()` and saving schema as JSON), then reconstruct the `Dataset` object manually by loading both and calling `Dataset(df, schema, name)`. There is no official Phoenix API for local dataset persistence and rehydration beyond this pattern. Would you like a code example for this manual approach? Sources: - <https://github.com/Arize-ai/phoenix/pull/2082|https://github.com/Arize-ai/phoenix/pull/2082> - <https://github.com/Arize-ai/phoenix/issues/2196|https://github.com/Arize-ai/phoenix/issues/2196>
💡 Hint: Mention RunLLM in the thread for followups.
Andrew S.
·
yes RunLLM However, these seems strange that there isn’t a public/official api for this
RunLLM
APP
·
You are correct—there is currently no public or official Phoenix API for saving and loading general `Dataset` objects locally; only trace datasets have this feature (<https://github.com/Arize-ai/phoenix/issues/2196|source>). For general datasets, the recommended manual approach is: 1. Save the DataFrame (e.g., as Parquet or CSV) and the schema (as JSON). 2. When reloading, reconstruct the DataFrame and schema, then create the `Dataset` object. Example: ``` import pandas as pd import phoenix as px import json # Save df.to_parquet("dataset.parquet") with open("schema.json", "w") as f: json.dump(schema.__dict__, f) # Load df = pd.read_parquet("dataset.parquet") with open("schema.json") as f: schema_dict = json.load(f) schema = px.Schema(**schema_dict) dataset = px.Dataset(df, schema, name="my_dataset") ``` This is a workaround due to the lack of a built-in API for local dataset persistence. Sources: - <https://github.com/Arize-ai/phoenix/issues/2196|https://github.com/Arize-ai/phoenix/issues/2196>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
Hey Andrew S. - If I understand this correctly you don't want to have to have to use the network to re-load the dataset. I think this makes sense. Have you tried pickle files before? Might be a decent solution until we have something more formal. cc Dustin N. for thinking about this more formally
Andrew S.
·
I didn’t try pickling but I certainly could. Thanks for the suggestion!
Dustin N.
·
Hi Andrew S. this is definitely an oversight—our dataset objects should be fully serializable as JSON (and stored however is most convenient), I’ll make a tracking ticket for you in a bit :)
Andrew S.
·
Sounds good! 🙏🏻 it will be handy
Dustin N.
·
hi Andrew S. this functionality is out for PR here: https://github.com/Arize-ai/phoenix/pull/8453
🙌2
Andrew S.
·
beautiful. I will use this for sure