Hi guys, my application logs get a lot of these:
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to phoenix-collector.my-company.svc.cluster.local:4317, retrying in 1.84s.Is this due to the changes made to improve memory management? Even thought this happens, health of the phoenix container seems to be ok.
Hey guys, sorry the delay on getting back to you. I've just updated phoenix to v11.35.0 and memory is fixed (Image). Thank you!!
Yesterday I sepated our phoenix in 2 instances. 1 for us to access the UI another one for act as a collector alone. I thought that our UI use could have an impact, but checking now, it's not the case. Bringing this here just to add up to our discoveries
Here's what happened overnight and this morning: Night period (10pm-8am): Phoenix memory stabilized at ~235MB with very low traffic (4-20 requests/minute). Memory stayed completely flat during this low-traffic period - no growth at all. Morning traffic spike (8am onwards): As soon as traffic increased (from ~20 to 60+ requests/minute at 8:30am), memory started climbing again. By 10am we had: - Traffic: 82 requests/minute - Memory: 8,497MB (from Grafana data) - Current pod usage: 10,129MB (just checked kubectl) Your hypothesis about data backing up inside Phoenix makes total sense. The pattern is crystal clear: - Low traffic: Memory stable at 235MB - High traffic: Memory grows continuously until OOM The throughput pattern exactly matches the memory growth pattern. During peak hours yesterday (when we saw the 15+ MB/minute database writes), requests were around 60-85 RPM. When traffic dropped overnight, memory stabilized completely.
Screenshot of our dash. The app health if good, memory is controlled, cpu too
Hey team, sorry for the delay in getting back to you! I took some time to run a thorough analysis with an AI assistant that helped me execute various commands against our Grafana monitoring and Aurora database to get a clearer picture of what's happening. Regarding your question about data transfer rates - we found that during the memory spike period (15:30-17:00 on Sept 8th), Aurora was receiving an average of 15-16 MB/minute from Phoenix, with peaks reaching 32-33 MB/minute. The write IOPS during this same period were averaging 250-265 operations per second, with peaks over 550 IOPS. So we have about 15+ MB/minute of data flowing from Phoenix to the database during the memory leak periods, while Phoenix memory was growing at roughly 125 MB/minute. The data flowing into Phoenix comes from our main application sending traces and spans via gRPC. We also noticed that Phoenix logs are extremely quiet - no errors, warnings, or memory-related messages even during the severe memory pressure periods. The container continues responding normally to requests even when consuming 11+ GB of memory. Currently our Phoenix pod is sitting at about 6.7GB memory usage out of our 12GB limit. The database connection patterns show Phoenix maintains a minimal footprint - typically just 1-3 active connections to Aurora at any given time. The database itself is around 171GB in size and has been growing very slowly. We have a 14 days retention policy set on phoenix that runs each 2 hours. cc Dustin N.
I don't know how to reproduce this and don't know also if we're doing anything wrong with our self host set up. Let me know if you need any tests from my side. Thanks, Roger!
We are experiencing critical memory management issues with Arize Phoenix v11.24.1 in our production Kubernetes environment that are making it extremely challenging to run Phoenix reliably. Ever since we put our product to production, we have a Phoenix deployment that performs badly sometimes. I have dimensioned the pods many times and have added more replicas but also ended up rolling back to just one instance because 2 or more would get a lot of issues with duplicate database transactions or locking database issues. I don't know why, but sometimes memory just keeps accumulating until OOM. The Phoenix application doesn't output logs and I don't see how to fix it. I have 8000MiB request memory and 12000MiB limits, but it continues to grow memory consumption until OOM. Our Phoenix instance exhibits an unusual memory consumption pattern where it maintains stable memory usage around 270-300MB for extended periods, then suddenly experiences massive spikes that never recover. Today we observed memory usage jump from 312MB to 11.7GB in just 90 minutes, with no corresponding increase in traffic or workload. This has forced us to repeatedly adjust our resource allocations. We initially ran with a 2GB memory limit but experienced 21 OOM-triggered restarts. We increased the limit to 8GB, which still proved insufficient. Today we had to increase it again to 12GB, and the pod is already consuming 11.5GB of that allocation. The memory growth appears to be a leak rather than cache or working memory, as the consumed memory is never released even during periods of low activity. The pod continues consuming memory until it hits the limit and gets OOM-killed, requiring manual intervention and progressively larger memory allocations that are becoming unsustainable for our infrastructure. We are running Phoenix as a StatefulSet in Kubernetes with PostgreSQL backend on AWS RDS. The deployment is single-replica with both HTTP and gRPC endpoints exposed. Our environment details include Phoenix v11.24.1 running on EKS with 2 CPU cores allocated. This issue is blocking our ability to maintain Phoenix in production as the ever-increasing memory requirements are not sustainable. The lack of meaningful logs from Phoenix makes troubleshooting particularly challenging. We would appreciate any guidance on configuration changes, known memory leak issues in this version, recommended debugging steps, or ways to enable more detailed logging to help identify the root cause. Has anyone else experienced similar memory consumption patterns with Phoenix in production Kubernetes environments? Are there any known memory management issues with v11.24.1, database connection pooling problems, or recommended configurations for long-running production deployments that could address both the memory leaks and multi-replica database issues?
