Thanks for running those queries, Kaito — this confirms what we suspected.
1. Reduce the retention period?
Reducing it to 30 days would cut it roughly in half. How far back do you typically need to look when debugging?
2. Implement span attribute filtering before ingestion?
It might help to look at a few of your larger spans to see what’s taking up space. Common culprits are embedding vectors and large RAG documents in the input or output strings. OpenInference has options to hide these via environment variables — see https://arize.com/docs/phoenix/tracing/how-to-tracing/advanced/masking-span-attributes for the full list. If you need more fine-grained control (e.g., truncation), you can also implement a custom OpenTelemetry SpanProcessor.
3. Are there any compression options for TOAST storage?
PostgreSQL already compresses TOAST data (your 11KB is compressed). Switching from `pglz` to `lz4` might help slightly on new data, but unlikely to be a big win.
Hi Kaito, Thanks for the detailed report — this is really helpful context. Initial Hypothesis Looking at your numbers, one possibility is that the 277GB of “external storage” might actually be TOAST table data (where PostgreSQL stores large JSONB columns) rather than reclaimable dead space or index bloat. Let me explain why, and share some queries that can help us confirm. Your spans table has two JSONB columns: • attributes – span attributes (LLM prompts, completions, tool definitions, RAG context, etc.) • events – span events With ~1.4M traces and 2-10 spans per trace, that’s roughly 3-14 million spans. Depending on your use case, LLM spans can vary widely in size — a simple completion might be a few KB, while RAG applications with large context windows could reach 50-100KB per span. If this is what’s happening, the 13GB you’re seeing as “table size” would only be the fixed-width columns (id, timestamps, integers) plus pointers to the TOAST data, with the actual attributes and events content stored separately. The diagnostic queries below should help us figure out if this is the case. --- Diagnostic Queries Please run these and share the results: 1. Complete Size Breakdown This will show us exactly where the 277GB is — heap vs indexes vs TOAST. This is the key question.
SELECT
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS heap_size,
pg_size_pretty(pg_indexes_size(c.oid)) AS indexes_size,
pg_size_pretty(pg_relation_size(c.reltoastrelid)) AS toast_heap_size,
pg_size_pretty(pg_indexes_size(c.reltoastrelid)) AS toast_index_size,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
ROUND(pg_relation_size(c.oid)::numeric / pg_total_relation_size(c.oid) * 100, 1) AS heap_pct,
ROUND(pg_indexes_size(c.oid)::numeric / pg_total_relation_size(c.oid) * 100, 1) AS indexes_pct,
ROUND((pg_relation_size(c.reltoastrelid) + pg_indexes_size(c.reltoastrelid))::numeric
/ pg_total_relation_size(c.oid) * 100, 1) AS toast_pct
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'spans'
AND n.nspname NOT IN ('pg_catalog', 'information_schema');2. Span Count (Fast) Tells us how many spans exist, so we can calculate average size per span.
SELECT
relname,
n_live_tup AS estimated_row_count
FROM pg_stat_user_tables
WHERE relname = 'spans';3. Average JSONB Size (Sampled) Shows how large your span attributes/events are on average. This samples ~1% of pages — safe to run on large tables.
SELECT
COUNT(*) AS sampled_rows,
pg_size_pretty(AVG(pg_column_size(attributes))::bigint) AS avg_attributes_size,
pg_size_pretty(AVG(pg_column_size(events))::bigint) AS avg_events_size,
pg_size_pretty((AVG(pg_column_size(attributes) + pg_column_size(events)) *
(SELECT n_live_tup FROM pg_stat_user_tables WHERE relname = 'spans'))::bigint) AS estimated_total_jsonb
FROM spans TABLESAMPLE SYSTEM (1);4. Dead Tuples Check Shows if there are dead rows that haven’t been vacuumed. High dead_pct (>10%) would mean something is blocking vacuum.
SELECT
relname,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'spans';5. Replication Slots Inactive replication slots can prevent vacuum from cleaning up — common on Cloud SQL.
SELECT
slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag
FROM pg_replication_slots;--- Some Thoughts on Your Questions
Why would VACUUM FULL fail to reclaim the 277GB?
If the 277GB is live TOAST data (your actual span attributes/events), VACUUM FULL would have worked correctly — there’s just nothing to reclaim because it’s not dead space. The diagnostic queries will help us confirm.
Does Phoenix store trace data in a way that causes TOAST bloat?
Phoenix stores span attributes and events as JSONB, which PostgreSQL automatically moves to TOAST tables when large. This is standard PostgreSQL behavior. If your spans contain full LLM prompts/completions, the TOAST storage could legitimately be large.
Is there a known issue with how Phoenix deletes old spans?
Not that we’re aware of. Phoenix uses standard ON DELETE CASCADE — when the retention policy deletes a trace, PostgreSQL automatically deletes all associated spans.
Could this be related to long-running transactions or replication slots?
It’s worth checking. The replication slots query above will show if there are any inactive slots with large WAL lag that could be preventing cleanup. --- Next Steps Would you be able to run these diagnostic queries and share the results? Also, what’s your current retention policy configuration? Once we see the breakdown, we’ll have a much clearer picture of what’s going on.
To help us develop more concrete ideas for improving the UI, could you share screenshots (with redactions if necessary) of the specific areas you’re currently having trouble with? Having visual examples will enable us to move forward more efficiently.
the UI doesn’t have a timeout per se: it’ll just wait for as long as it needs, so the problem may be something else. Can you tell us which page that you were trying to access that was having a timeout issue resulting in the “Something went wrong” screenshot above? The screenshot seems to be about a parsing issue, which means the server has returned something for it to parse, but if it’s a timeout issue then the server shouldn’t have anything to return. (In past situations, there had been other people experiencing UI timeout, but that was because they had a proxy or load balancer that was timing out the connection)
have you tried adding the instrumentor in your __init__.py file? For example:
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from openinference.instrumentation.google_adk import GoogleADKInstrumentor
from . import agent
endpoint = "http://YOUR_ENDPOINT/v1/traces"
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint)))
GoogleADKInstrumentor().instrument(tracer_provider=tracer_provider)
This is a really cool idea, and we appreciate you taking the initiative. Before moving ahead with a PR, could you open a GitHub issue outlining the proposed image testing functionality and how you envision the workflow? That’ll help us align on scope, roadmap fit, and design considerations up front.
thank you for bringing this to our attention. we really appreciate it!
ok. if the old data is deleted, it could just be that new data is very large, or postgres hasn’t vacuumed the dead tuples yet. you can check how much dead tuples there are and when the auto vacuum took place via the following query.
SELECT
n.nspname AS schema,
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS main_table_size,
pg_size_pretty(pg_relation_size(c.reltoastrelid)) AS toast_heap_size,
pg_size_pretty(pg_indexes_size(c.reltoastrelid)) AS toast_index_size,
pg_size_pretty(
pg_relation_size(c.reltoastrelid) + pg_indexes_size(c.reltoastrelid)
) AS toast_total_size,
ROUND(
(pg_relation_size(c.reltoastrelid) + pg_indexes_size(c.reltoastrelid))::numeric
/ NULLIF(pg_total_relation_size(c.oid), 0) * 100,
2
) AS toast_pct_of_total,
ct.relname AS toast_table_name,
COALESCE(st.n_live_tup, 0) AS live_toast_chunks,
COALESCE(st.n_dead_tup, 0) AS dead_toast_chunks,
COALESCE(
ROUND(st.n_dead_tup::numeric / NULLIF(st.n_live_tup + st.n_dead_tup, 0) * 100, 2),
0
) AS toast_dead_pct,
st.last_autovacuum AS toast_last_vacuum
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_class ct ON ct.oid = c.reltoastrelid
LEFT JOIN pg_stat_all_tables st ON st.relid = ct.oid
WHERE c.relkind = 'r'
AND c.reltoastrelid > 0
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
AND pg_total_relation_size(c.oid) > 10000000
ORDER BY pg_relation_size(c.reltoastrelid) + pg_indexes_size(c.reltoastrelid) DESC
LIMIT 20;