We regularly get timeouts querying traces in the UI when using metadata filters. We've turned data retention down to 15 days but still if we expand beyond a 4 hour time period, we get timeouts. We're doing about 2-4 million traces a week so a bit heavy. The db is running on t4g.large We're planning to scale up the db further to see if it helps but wondering if there's anything else we can do? Seems like it'd be ideal to partition the spans table somehow. Our spans table is 407gb Or maybe something else?
Scaling up your database instance may help, but Phoenix UI timeouts with metadata filters and large spans tables are a known issue, especially at your data volume. Key bottlenecks include full table scans and inefficient queries on the spans table, which can reach hundreds of GB and tens of millions of rows. While Phoenix does not currently support table partitioning out-of-the-box, adding targeted indexes (e.g., on frequently filtered metadata fields, start_time, and project/trace IDs) has been shown to significantly improve query performance. For example, a compound index on (trace_rowid, parent_id, start_time DESC, id DESC) reduced query times from 15s to 1s in some cases, though results may vary depending on your query patterns and Postgres version. See this discussion for details and example index DDL.Other recommended actions: (1) Reduce the amount of data loaded in the UI by using root span filters, lazy loading, and smaller page sizes; (2) Mask or truncate large span attributes to reduce row size; (3) Regularly vacuum/analyze your database to reclaim space and update statistics; (4) Consider sharding or partitioning at the Postgres level if your DB supports it, but this requires manual setup and is not natively managed by Phoenix. For more, see performance tuning suggestions, production guide, and indexing discussion. Would you like a step-by-step guide for adding indexes or optimizing your queries?Sources: - GitHub Issue #8420: Slow database queries in spans/traces overview - GitHub Issue #4006: graphql queries become slow as data increases - Phoenix Production Guide
💡 Hint: Mention RunLLM in the thread for followups.
🔒[private user] help here?
Wondering if the query is loading everything in mem as well. Haven't tracked down the query yet but this is what our memory does when we visit the dashboard
The current metadata hasn’t been optimized for querying on such a large database. If you have the ability to modify the database, you can try creating a specialized index to improve performance, if you know the specific key you’re searching for. For example:
CREATE INDEX CONCURRENTLY ix_spans_attributes_metadata_conversation_id
ON public.spans ((attributes #>> '{metadata,conversation_id}'::text[]))
WHERE (attributes #>> '{metadata,conversation_id}'::text[]) IS NOT NULL;The dashboard simultaneously executes several separate time-series queries grouped by time and other dimensions. The query planner is likely choosing a hash join on the trace table, which can lead to higher memory consumption during execution.
But i don’t have a precise explanation for the amount of memory that it is using.
Ok awesome, thanks! We'll try adding some indexes. Is there an environment variable to increase the timeouts in the dashboard? When we load data using phoenix-client, we set a higher timeout to avoid issues. Can't seem to find one for the UI
the UI doesn’t have a timeout per se. i think the timeout probably comes from your proxy or load balancer
Yep sorry. To clarify, query timeouts for the queries the dashboard is making
