graph-node: Probe doesn't detect provider failure

## Issue
Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.
```sh
Jul 12 05:58:20.416 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:20.441 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: store_events, component: NotificationListener
Jul 12 05:58:20.623 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: store_events, component: NotificationListener
Jul 12 05:58:20.639 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:21.634 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: store_events, component: NotificationListener
Jul 12 05:58:21.647 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:59:02.699 ERRO Postgres connection error, error: terminating connection due to administrator command, pool: main, shard: primary, component: ConnectionPool
```
The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node  manuallly.

## Expectation
We have an alerting channel which fires alert if `eth_rpc_status{provider="xxx"}` metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.
If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph-node: Probe doesn't detect provider failure #417

Issue

Expectation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

graph-node: Probe doesn't detect provider failure #417

Description

Issue

Expectation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions