Analyzing Query Performance in Phoenix with Postgres Aurora RDS

Jamil S. · 2025-03-19T21:49:55.255Z

Hello everyone. We are using Phoenix with Postgres Aurora RDS, but we are experiencing slowness when querying a long period in the project. Is there anything we can use to analyze query performance? Tks

RunLLM
APP
·
To analyze query performance in Phoenix, especially when using Postgres Aurora RDS, you can consider the following steps: 1. *Database Profiling*: Enable query logging in Postgres to capture slow queries. This can help identify which queries are taking longer to execute. You can use the `pg_stat_statements` extension to track query performance and identify bottlenecks. 2. *Phoenix Performance Improvements*: According to <https://github.com/Arize-ai/phoenix/issues/4006|GitHub Issue #4006>, Phoenix has implemented improvements such as lazy loading of spans and reducing page size, which can help with performance issues. Ensure you are using the latest version of Phoenix to benefit from these enhancements. Would you like more details on how to implement these improvements or further optimize your setup? Sources: - <https://github.com/Arize-ai/phoenix/issues/4006|GitHub Issue #4006>
💡 Hint: Mention RunLLM in the thread for followups.
Mikyo
·
Hey Jamil S. can you tell us which version of phoenix you are running? We did a bit of performance optimizations lately that might fix this. If not you can def. hook up otel or prometheus to analyze stuff but that might take a bit of time.
Jamil S.
·
The version is 8.14.1. I'm still setting up OTel and Prometheus for analysis. If anyone has a Grafana dashboard to import, it would be helpful for error analysis as well.
Jamil S.
·
Error 500 gateway timeout
Mikyo
·
Hmm, sorry to hear that. Is it possible for you to increase the time window for your alb / elb to be a bit bigger? It looks like it's terminating the connection.
Mikyo
·
Any sort of details would be very helpful: trace count per hour trace depth span payload size using langgraph
Mikyo
·
we're working on retention policies but if you don't care about your old data you could also purge some of it to make the queries less expensive.
Arthur M.
·
I'm also experiencing performance issues with our self-hosted Phoenix instance when querying data beyond a 6-hour interval. I would like to be able to analyze more historical data (e.g., last 12 hours), but one query execution becomes unresponsive when the time window exceeds approximately 6 hours. To investigate, I created a script (test_phoenix_db_perf.sh, attached) that incrementally queries intervals from 1 to 12 hours, capturing metrics like execution time, rows returned, and planning time. The query consistently executes quickly up to 6 hours, but completely stalls at 7 hours. I captured this query in AWS Top SQL tab. Example results: 1 hour interval: execution ~358 ms 3 hour interval: execution ~787 ms 6 hour interval: execution ~1602 ms 7 hour interval: query does not complete Is there a known reason or configuration limiting query performance beyond a certain time interval? Any suggestions or insights to resolve this issue would be greatly appreciated. Thanks!

Arthur M.

As a quick follow-up, I managed to obtain the query performance results for 7 and 8-hour intervals, but the execution time drastically increased and became astronomical—several orders of magnitude higher compared to the 5-6 hour intervals. I'm sharing the specific results below for reference. Clearly, there's a significant performance degradation after the 6-hour mark. Could you please help us investigate why performance deteriorates so severely with these slightly longer intervals and how we might optimize or resolve this issue? Thank you!

----------------------------------------
Testing query with 5 hour(s) interval...
Hours: 5
Total script execution time: 1409.886950000 ms
Rows returned: 100
Planning time: 1.852 ms
Execution time: 1328.960 ms
----------------------------------------
Testing query with 6 hour(s) interval...
Hours: 6
Total script execution time: 1676.035184000 ms
Rows returned: 100
Planning time: 1.878 ms
Execution time: 1602.235 ms
----------------------------------------
Testing query with 7 hour(s) interval...
Hours: 7
Total script execution time: 262886.238704000 ms
Rows returned: 100
Planning time: 1.899 ms
Execution time: 262796.189 ms
----------------------------------------
Testing query with 8 hour(s) interval...
Hours: 8
Total script execution time: 298920.754487000 ms
Rows returned: 100
Planning time: 1.851 ms
Execution time: 298714.283 ms
----------------------------------------
Testing query with 9 hour(s) interval...

Arthur M.
·
I'm also using version 8.14.1
Mikyo
·
Hey Arthur M. sorry you are experiencing slowness. Definitely would love to help you alleviate some of your bottlenecks. However I'm not sure I quite follow your script. It looks like you are directly querying the DB, which depending on your access patterns and volume will increase as you query. For querying the db directly if you are trying to pull large volumes down you would def. want to do this in chunks so as to not overload the DB. The app itself is paginated to alleviate the load. Was your investigation rooted in slowness on a specific view in the platform? Unfortunately this is a multi dimensional problem depending on your payload sizes and the volume of your data. If you could provide me with some more clarifying details such as trace count per hour, trace depth, and span payload size it might give me and the team a better sense of the workload you are dealing with and whether your system is well provisioned to handle the load. Am happy to jump on a call (maybe next week) if you are open to us taking a closer look as well. Thanks for the report as always.
Mikyo
·
Oh I see you have the limit 100 on there. Sorry been a busy day. Might need to digest a little longer.
Jamil S.
·
Good evening, I increased the timeout to 20 minutes within the ingress in K8s, which improved the timeout issue for queries up to 12 hours ago. However, it's still very slow due to the connection and volume. I believe that creating separate projects and segregating the data could reduce this slowness. Even using GraphQL and having an async response will help with the issue. Thank you for your support. Mikyo
👍1
Mikyo
·
Thanks for the update Jamil S. - will continue to dig in with the team to squeeze out more performance. Thanks for your patience.
Arthur M.
·
Hey, thanks for the quick response! I'll gather the requested numbers tomorrow and get back to you shortly after extracting these metrics directly from our DB. To clarify my investigation further: I'm executing this query directly because we initially experienced slowness within the Phoenix platform itself, especially when selecting to access data beyond 12 hours in the interface. This prompted me to dive deeper to understand the performance bottleneck. Also worth mentioning: the query in my script is exactly the one Phoenix is running internally. I captured it directly from the AWS Top SQL view, so it accurately represents what the Phoenix app executes against our database. Having a call next week sounds great—let's coordinate that once I have the metrics. I'll follow up soon. Thanks again for looking into this!

Analyzing Query Performance in Phoenix with Postgres Aurora RDS

86 comments

Analyzing Query Performance in Phoenix with Postgres Aurora RDS

86 comments