Hi Kenan, what kind of work flow do you have in mind? According Gemini documentation, it’s not possible to retrieve or view cached content. What is visible to the instrumentor is the cache resource name (server-generated) and the cached token count per request. See the attached screenshots.
and yes, large spans will be TOASTed
you can try deploying a collector that transforms the span, e.g. delete the attribute when the span is not root, or if the span matches a certain name, and the re-export the span.
processors:
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- delete_key(span.attributes, "input.value")
where IsRootSpan() == falseAre you using Postgres? If you know the specific key you’re searching for, you can try creating a specialized index to improve performance. For example:
CREATE INDEX CONCURRENTLY ix_spans_attributes_metadata_conversation_id
ON public.spans ((attributes #>> '{metadata,conversation_id}'::text[]))
WHERE (attributes #>> '{metadata,conversation_id}'::text[]) IS NOT NULL;Sounds good. Please feel free to reach out if you have more questions
Hi Kaito, Thanks for the follow-up — glad to see the retention policy is working correctly (54% of spans deleted, low dead tuple %). To answer your questions: 1. Recommended approach to reclaim the 147GB: It depends on your goal. If you just want to prevent the database from growing further, you don’t need to do anything — the deleted TOAST space is already marked as reusable in PostgreSQL’s Free Space Map, and new spans will fill it instead of extending the table. If you want to actually reduce storage costs, you’d need to run VACUUM FULL to shrink the table files, then migrate to a new Cloud SQL instance with a smaller disk (using Database Migration Service) or contact Google Support to request a disk shrink. Cloud SQL doesn’t automatically shrink provisioned storage when data is deleted. 2. Maintenance windows for self-hosted Phoenix: If you do run VACUUM FULL:
Schedule it right after your Sunday 7PM retention cleanup, when the table is smallest
Stop Phoenix ingestion before running it — the table is fully locked during the operation, so incoming traces will fail
You’ll need ~145GB of free disk space — VACUUM FULL creates a new copy of the table before dropping the old one
Expect 30-90 minutes for ~145GB of live data, possibly longer depending on disk I/O
3. Alternatives that don’t require full table locks: Unfortunately not on Cloud SQL. pg_repack would be the standard solution for online table rewrites, but it’s not in Cloud SQL’s supported extensions list. 4. Preventing this from recurring: This will happen after every retention cleanup — PostgreSQL doesn’t automatically shrink files. Your options:
Run retention more frequently (daily or weekly) so deletions are smaller and space gets reused faster
Schedule periodic VACUUM FULL (e.g., quarterly) if you need to keep the table compact
Accept the bloat if the cost is manageable — as long as ingestion < deletion, the table won’t grow beyond its current size
Thanks for running those queries, Kaito — this confirms what we suspected.
1. Reduce the retention period?
Reducing it to 30 days would cut it roughly in half. How far back do you typically need to look when debugging?
2. Implement span attribute filtering before ingestion?
It might help to look at a few of your larger spans to see what’s taking up space. Common culprits are embedding vectors and large RAG documents in the input or output strings. OpenInference has options to hide these via environment variables — see https://arize.com/docs/phoenix/tracing/how-to-tracing/advanced/masking-span-attributes for the full list. If you need more fine-grained control (e.g., truncation), you can also implement a custom OpenTelemetry SpanProcessor.
3. Are there any compression options for TOAST storage?
PostgreSQL already compresses TOAST data (your 11KB is compressed). Switching from `pglz` to `lz4` might help slightly on new data, but unlikely to be a big win.
Hi Kaito, Thanks for the detailed report — this is really helpful context. Initial Hypothesis Looking at your numbers, one possibility is that the 277GB of “external storage” might actually be TOAST table data (where PostgreSQL stores large JSONB columns) rather than reclaimable dead space or index bloat. Let me explain why, and share some queries that can help us confirm. Your spans table has two JSONB columns: • attributes – span attributes (LLM prompts, completions, tool definitions, RAG context, etc.) • events – span events With ~1.4M traces and 2-10 spans per trace, that’s roughly 3-14 million spans. Depending on your use case, LLM spans can vary widely in size — a simple completion might be a few KB, while RAG applications with large context windows could reach 50-100KB per span. If this is what’s happening, the 13GB you’re seeing as “table size” would only be the fixed-width columns (id, timestamps, integers) plus pointers to the TOAST data, with the actual attributes and events content stored separately. The diagnostic queries below should help us figure out if this is the case. --- Diagnostic Queries Please run these and share the results: 1. Complete Size Breakdown This will show us exactly where the 277GB is — heap vs indexes vs TOAST. This is the key question.
SELECT
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS heap_size,
pg_size_pretty(pg_indexes_size(c.oid)) AS indexes_size,
pg_size_pretty(pg_relation_size(c.reltoastrelid)) AS toast_heap_size,
pg_size_pretty(pg_indexes_size(c.reltoastrelid)) AS toast_index_size,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
ROUND(pg_relation_size(c.oid)::numeric / pg_total_relation_size(c.oid) * 100, 1) AS heap_pct,
ROUND(pg_indexes_size(c.oid)::numeric / pg_total_relation_size(c.oid) * 100, 1) AS indexes_pct,
ROUND((pg_relation_size(c.reltoastrelid) + pg_indexes_size(c.reltoastrelid))::numeric
/ pg_total_relation_size(c.oid) * 100, 1) AS toast_pct
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'spans'
AND n.nspname NOT IN ('pg_catalog', 'information_schema');2. Span Count (Fast) Tells us how many spans exist, so we can calculate average size per span.
SELECT
relname,
n_live_tup AS estimated_row_count
FROM pg_stat_user_tables
WHERE relname = 'spans';3. Average JSONB Size (Sampled) Shows how large your span attributes/events are on average. This samples ~1% of pages — safe to run on large tables.
SELECT
COUNT(*) AS sampled_rows,
pg_size_pretty(AVG(pg_column_size(attributes))::bigint) AS avg_attributes_size,
pg_size_pretty(AVG(pg_column_size(events))::bigint) AS avg_events_size,
pg_size_pretty((AVG(pg_column_size(attributes) + pg_column_size(events)) *
(SELECT n_live_tup FROM pg_stat_user_tables WHERE relname = 'spans'))::bigint) AS estimated_total_jsonb
FROM spans TABLESAMPLE SYSTEM (1);4. Dead Tuples Check Shows if there are dead rows that haven’t been vacuumed. High dead_pct (>10%) would mean something is blocking vacuum.
SELECT
relname,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'spans';5. Replication Slots Inactive replication slots can prevent vacuum from cleaning up — common on Cloud SQL.
SELECT
slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
FROM pg_replication_slots;--- Some Thoughts on Your Questions
Why would VACUUM FULL fail to reclaim the 277GB?
If the 277GB is live TOAST data (your actual span attributes/events), VACUUM FULL would have worked correctly — there’s just nothing to reclaim because it’s not dead space. The diagnostic queries will help us confirm.
Does Phoenix store trace data in a way that causes TOAST bloat?
Phoenix stores span attributes and events as JSONB, which PostgreSQL automatically moves to TOAST tables when large. This is standard PostgreSQL behavior. If your spans contain full LLM prompts/completions, the TOAST storage could legitimately be large.
Is there a known issue with how Phoenix deletes old spans?
Not that we’re aware of. Phoenix uses standard ON DELETE CASCADE — when the retention policy deletes a trace, PostgreSQL automatically deletes all associated spans.
Could this be related to long-running transactions or replication slots?
It’s worth checking. The replication slots query above will show if there are any inactive slots with large WAL lag that could be preventing cleanup. --- Next Steps Would you be able to run these diagnostic queries and share the results? Also, what’s your current retention policy configuration? Once we see the breakdown, we’ll have a much clearer picture of what’s going on.
To help us develop more concrete ideas for improving the UI, could you share screenshots (with redactions if necessary) of the specific areas you’re currently having trouble with? Having visual examples will enable us to move forward more efficiently.
