Critical memory management 🧵
We are experiencing critical memory management issues with Arize Phoenix v11.24.1 in our production Kubernetes environment that are making it extremely challenging to run Phoenix reliably. Ever since we put our product to production, we have a Phoenix deployment that performs badly sometimes. I have dimensioned the pods many times and have added more replicas but also ended up rolling back to just one instance because 2 or more would get a lot of issues with duplicate database transactions or locking database issues. I don't know why, but sometimes memory just keeps accumulating until OOM. The Phoenix application doesn't output logs and I don't see how to fix it. I have 8000MiB request memory and 12000MiB limits, but it continues to grow memory consumption until OOM. Our Phoenix instance exhibits an unusual memory consumption pattern where it maintains stable memory usage around 270-300MB for extended periods, then suddenly experiences massive spikes that never recover. Today we observed memory usage jump from 312MB to 11.7GB in just 90 minutes, with no corresponding increase in traffic or workload. This has forced us to repeatedly adjust our resource allocations. We initially ran with a 2GB memory limit but experienced 21 OOM-triggered restarts. We increased the limit to 8GB, which still proved insufficient. Today we had to increase it again to 12GB, and the pod is already consuming 11.5GB of that allocation. The memory growth appears to be a leak rather than cache or working memory, as the consumed memory is never released even during periods of low activity. The pod continues consuming memory until it hits the limit and gets OOM-killed, requiring manual intervention and progressively larger memory allocations that are becoming unsustainable for our infrastructure. We are running Phoenix as a StatefulSet in Kubernetes with PostgreSQL backend on AWS RDS. The deployment is single-replica with both HTTP and gRPC endpoints exposed. Our environment details include Phoenix v11.24.1 running on EKS with 2 CPU cores allocated. This issue is blocking our ability to maintain Phoenix in production as the ever-increasing memory requirements are not sustainable. The lack of meaningful logs from Phoenix makes troubleshooting particularly challenging. We would appreciate any guidance on configuration changes, known memory leak issues in this version, recommended debugging steps, or ways to enable more detailed logging to help identify the root cause. Has anyone else experienced similar memory consumption patterns with Phoenix in production Kubernetes environments? Are there any known memory management issues with v11.24.1, database connection pooling problems, or recommended configurations for long-running production deployments that could address both the memory leaks and multi-replica database issues?
Phoenix supports several strategies for critical memory management, especially in production. You can configure automatic trace retention and cleanup using the PHOENIX_DEFAULT_RETENTION_POLICY_DAYS environment variable, which sets how long traces are kept before being purged. This helps control memory and storage usage. The default is infinite retention, but you can set it to a specific number of days for automatic cleanup. Cleanup runs weekly, and you can also create custom retention policies per project via the UI. Manual deletion of traces and data is also supported through the UI and REST API. For scaling, Phoenix is stateless and can be scaled out horizontally with multiple containers backed by a single Postgres database. Internally, 2GB RAM and 1 CPU are typical for hosted Phoenix, but you should monitor and scale resources as needed for higher workloads. Memory requirements depend on ingestion volume, retention settings, and the variety of labels/attributes in your data.For advanced scenarios, you can disable external resource loading in air-gapped deployments, and Phoenix supports database backups and batch processing for efficient ingestion. There is no built-in syntax to save only the last X traces, but you can filter by time range when exporting traces. If you need to clear all data, use px.close_app(delete_data=True) in Python. For persistent high-volume workloads, monitor memory and disk usage, and adjust resources or retention policies accordingly. Would you like more detail on any specific aspect?Sources: - Phoenix Production Guide
💡 Hint: Mention RunLLM in the thread for followups.
I don't know how to reproduce this and don't know also if we're doing anything wrong with our self host set up. Let me know if you need any tests from my side. Thanks, Roger!
are you using our helm chart for deploying phoenix, could I see the configuration or pod definition
Hey team, sorry for the delay in getting back to you! I took some time to run a thorough analysis with an AI assistant that helped me execute various commands against our Grafana monitoring and Aurora database to get a clearer picture of what's happening. Regarding your question about data transfer rates - we found that during the memory spike period (15:30-17:00 on Sept 8th), Aurora was receiving an average of 15-16 MB/minute from Phoenix, with peaks reaching 32-33 MB/minute. The write IOPS during this same period were averaging 250-265 operations per second, with peaks over 550 IOPS. So we have about 15+ MB/minute of data flowing from Phoenix to the database during the memory leak periods, while Phoenix memory was growing at roughly 125 MB/minute. The data flowing into Phoenix comes from our main application sending traces and spans via gRPC. We also noticed that Phoenix logs are extremely quiet - no errors, warnings, or memory-related messages even during the severe memory pressure periods. The container continues responding normally to requests even when consuming 11+ GB of memory. Currently our Phoenix pod is sitting at about 6.7GB memory usage out of our 12GB limit. The database connection patterns show Phoenix maintains a minimal footprint - typically just 1-3 active connections to Aurora at any given time. The database itself is around 171GB in size and has been growing very slowly. We have a 14 days retention policy set on phoenix that runs each 2 hours. cc Dustin N.
Screenshot of our dash. The app health if good, memory is controlled, cpu too
Thanks for sharing the update—good to know it’s stable for now. When you get a chance, can you share an updated screenshot of the Grafana dashboard tomorrow? I’m interested to see whether the memory has just plateaued and stayed there or if it has come down eventually. In your screenshot there’s a chart called “Throughput by Status Code (rpm)” which roughly corresponds to the shape of the CPU usage. I wonder if this is ingestion traffic. Since the increase in memory coincides with the doubling of requests, one hypothesis is that the amount of data being pushed into Phoenix exceeds the insertion rate for Postgres. Currently all that excess data would just be backing up inside Phoenix (since you mentioned “about 15+ MB/minute of data flowing from Phoenix to the database during the memory leak periods, while Phoenix memory was growing at roughly 125 MB/minute”), but eventually would be pushed into Postgres if the ingestion subsides.
Here's what happened overnight and this morning: Night period (10pm-8am): Phoenix memory stabilized at ~235MB with very low traffic (4-20 requests/minute). Memory stayed completely flat during this low-traffic period - no growth at all. Morning traffic spike (8am onwards): As soon as traffic increased (from ~20 to 60+ requests/minute at 8:30am), memory started climbing again. By 10am we had: - Traffic: 82 requests/minute - Memory: 8,497MB (from Grafana data) - Current pod usage: 10,129MB (just checked kubectl) Your hypothesis about data backing up inside Phoenix makes total sense. The pattern is crystal clear: - Low traffic: Memory stable at 235MB - High traffic: Memory grows continuously until OOM The throughput pattern exactly matches the memory growth pattern. During peak hours yesterday (when we saw the 15+ MB/minute database writes), requests were around 60-85 RPM. When traffic dropped overnight, memory stabilized completely.
Thank you for the detailed update. This is a challenge with high-volume ingestions, and we understand the operational impact this creates. On our end we’ll need to evaluate various strategies for preventing the OOM scenario.
