Scaling Phoenix for High Volume: Strategies to Handle 200M+ Spans and Large Data Sets | Arize AI Community

Arize AI Community Icon

Scaling Phoenix for High Volume: Strategies to Handle 200M+ Spans and Large Data Sets | Arize AI Community

RunLLM
APP
·
·
Phoenix is not optimized for very large-scale deployments (hundreds of millions of spans, multi-terabyte databases) in its open-source form. Common issues at this scale include slow queries, UI timeouts, and database contention, especially with large spans tables and during migrations. Recommended approaches include: (1) aggressive data retention policies to keep the database size manageable, (2) regular pruning of low-value spans, (3) scaling up database resources (CPU, memory, parallel workers), and (4) horizontal scaling of Phoenix containers behind a load balancer, all backed by a single Postgres instance. For even larger workloads or OLAP use cases, Arize AX (the enterprise product) uses a proprietary database designed for high-volume analytics, which is not available in open-source Phoenix. There is no documented support for sharding or distributed Postgres in open-source Phoenix, and performance bottlenecks are expected at this scale without moving to a more specialized backend or product. Recent improvements in Phoenix include UI lazy loading, pagination, and query optimizations, but these only partially mitigate issues with very large datasets. For production, it's recommended to monitor memory and disk usage, tune Postgres (indexes, query plans, connection pooling), and consider splitting workloads across multiple Phoenix instances if organizationally feasible. If you continue to experience performance issues, you may be hitting the architectural limits of open-source Phoenix and Postgres for this use case. Would you like more detail on specific tuning steps or architectural alternatives? Sources: - <https://arize.com/docs/phoenix/self-hosting/architecture|Phoenix architecture and scaling> - <https://arize.com/docs/phoenix/production-guide|Production guide: scaling and resource planning> - <https://github.com/Arize-ai/phoenix/issues/6201|UI slowdowns> - <https://github.com/Arize-ai/phoenix/issues/8420|database tuning>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
·
Hey Dillon D. thanks for reaching out and sorry you are facing scaling issues. Appreciate you working with us. There are a few things that are in the works to alleviate some of your issues:
we're introducing an explicit pre-upgrade migrations in our helm charts
we're have a way to migrate indexes concurrently *https://github.com/Arize-ai/phoenix/blob/4b5dd706cfd6fbd0a288dba3f9661f6a419c3410/src/phoenix/config.py#L180)
There are some limitations to the table vacuuming that might be affecting things (https://arize.com/docs/phoenix/settings/data-retention#why-doesn%E2%80%99t-my-storage-usage-decrease-after-deleting-traces). I'm inclined to believe the JSONB columns you have might not be getting cleaned up well here. What's the overall "non functional" behavior you are seeing? Happy to brainstorm some solutions to unblock you. AFK for a few hours. Will check back in.
Dillon D.
·
·
I was able to get past all of the migrations and am upgraded at this point. Part of the reason I wen through upgrading though was largely due to the UI and APIs no longer being able to serve data due to timeouts on the queries that are issued. I was just making sure that there weren't any performance improvements that came through in version 13. After upgrade we're still reliably experiencing timeouts in the UI and API and don't really have a path to utilizing our data. I've pruned all low value spans at this point. I updated the work_mem and number of parallel executors at the postgres user level as well to try and reduce spilling during queries. I'm also not accepting any new data currently so the current state is without having to deal with any amount of write contention.
Dillon D.
·
·
Thanks for the reply!
Mikyo
·
·
Hey Dillon, sorry you're going through this. We'll do hard pass on stress testing to try to unblock you. Let me share your experience with the team before I take any more of your time.
Dillon D.
·
·
Hey no worries! Appreciate it!
Dillon D.
·
·
Hey good morning! Any update on this?
Mikyo
·
·
Good morning Dillon D. - we reviewed your issues last Friday and I don't think there's any easy fixes on our end. We've actually never gotten our Phoenix deployments that large. We're also working a bit blind as we don't quite know the state of your DB. It could be that there's a fair amount stored in TOAST (The Oversized-Attribute Storage Technique) and there are some heavy table scans going on. The only immediate thing I think you could do is to start exporting to another phoenix instance as well. This way you can start treating your old phoenix as the cold tier. would you be open to having a call with us? cc Roger Y.
Dillon D.
·
·
Yeah we can get on a call. My DB might not be in the same state by the time we do though. I'm needing to unblock my team, and will be continuing to prune data in order to do so. The biggest improvement from what I've seen as I've been looking into this would be to denormalize the project id onto the spans table so that you don't have to do a join from traces to spans for all of the project level queries. For example, just loading the spans for a 1 hour window should be fast, but in order to find eligible spans for the current project, the query plan forces the db into a full table scan of the traces table before joining against spans which is in part due to the low selectivity of the project id column on the traces table. The second thing would be to separate your graphql resolvers to allow for establishing read only connections so that you can utilize read replicas for read heavy workloads. Right now when we have high write volumes, we end up seeing a fair amount of table locking due to heavy analytics read queries attempting to be run on write heavy tables. This of course would be mitigated by making the read operations less heavy, but any way to offload any read heavy workloads from the writer would be helpful.
Dillon D.
·
·
I get that ^ is only really considering my use case and pain point though, and that the data model and approach to this point is driven off of other use cases that don't match mine - just noting my observations more than anything in case that resonates with anyone!
Mikyo
·
·
Yeah we are looking into the query offloading - but will take a bit of time to narrow down the strategy. Def. do what you need to do! LMK if you have time to catch up. We'll keep digging on our end as well. https://calendar.app.google/SdrkCTheLESerL2G6
Roger Y.
·
·
Dillon D. We shipped a small fix in v13.22.2 that removed an inadvertent bug that was lengthening overall query time. Can you give it a try to see if things have improved? Another thing you can do is install GraphQL Network Inspector in Chrome, which can help us identify which query is the one that’s timing out. Below is a screenshot of what that looks like when you inspect the page.
Dillon D.
·
·
Dillon D.
·
·
I can work on upgrading later this evening if you think it would materially help. interesting to note there that the ProjectPageHeaderQuery and ProjectPageQuery both take ~10s
Dillon D.
·
·
Also thanks for the response Roger Y.