Ok awesome, thanks! We'll try adding some indexes.
Is there an environment variable to increase the timeouts in the dashboard? When we load data using phoenix-client, we set a higher timeout to avoid issues. Can't seem to find one for the UI
Wondering if the query is loading everything in mem as well. Haven't tracked down the query yet but this is what our memory does when we visit the dashboard
We regularly get timeouts querying traces in the UI when using metadata filters. We've turned data retention down to 15 days but still if we expand beyond a 4 hour time period, we get timeouts. We're doing about 2-4 million traces a week so a bit heavy. The db is running on t4g.large
We're planning to scale up the db further to see if it helps but wondering if there's anything else we can do? Seems like it'd be ideal to partition the spans table somehow. Our spans table is 407gb
Or maybe something else?
My assumption is the batches are queued in memory for processing and it isn't processing fast enough leading to this backup. But maybe the problem is elsewhere?
Ah ok interesting. We are having ingest problems where large volumes of spans are causing our instance to become unavailable. We see memory maxing out when this happens (4gb 2vcpu)
In the production guide, it mentions horizontal scaling which is good but not much other detail. Wondering if there is any gotchas with running more than 1 instance at a time? Assuming its safe but hoping there's a pre-existing answer to this
Yea pretty much for grouping.
The way we run some of our evals is we load all the llm inputs/outputs and tool inputs/outputs and concat them into 1 large string for evaluation. So typically we'll find the trace and grab all the spans under that which naturally ends up grouping them.
For now, I did just update our stuff to add a metadata field I can grab all the spans with and then group them by the session id. This seems to work but would be awesome to just do it off the root trace id.