Hey Arize team! We encountered an error today that tripped us up for a while. Our postgres connection was failing however the /healthz endpoint was returning OK, leading to us not properly reporting that the DB connection was the underlying issue. Would it be possible to verify DB connectivity on the healthz endpoint? Perhaps just run a select 1 against the database?
Sorry to hear you were hung up on this for a while. Roger Y. or Dustin N. do you think there are any negative implications for something like this? On the one hand It may somewhat blur the line between service responsibilities but on the other you do need a database connection for phoenix to work properly.
latency might be an issue, maybe we can make it an optional check or supplying a more comprehensive health check
Right now, the only thing that the current health endpoint checks is that python has not died, not that the app is functional, which, to my understanding, is what /healthz is typically meant for. I’d like to see it enabled by default due to the inability to have a working app without a database, but whichever direction it goes, having an option is crucial for us as we deploy more instances of phoenix 🙂
we could add a readiness endpoint
/readyz?
In previous roles deploying to k8s, we had 2 endpoints, one for readiness (Is initial startup done) and health (Is the app still functional)
These connections can fall apart after app startup, so teams I worked with made a point to test connections to our db in both of these so that our apps could be automatically bounced in an attempt to correct underlying issues. I’m curious what is driving the hesitation for a very light-weight call like select 1 in the hot path here.
ok. maybe something like
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "ready",
"checks": {
"database": { "status": "up", "latency_ms": 12 }
}
}I think the most important piece is that the status code is not a 2xx status code when the database is down
ok. maybe something like 503
A 503 would be very fitting 🙂
Writing an issue, will post momentarily
I’m even happy to help contribute if the team doesn’t have the bandwidth for this
