Skip to content
This repository was archived by the owner on Mar 3, 2026. It is now read-only.
This repository was archived by the owner on Mar 3, 2026. It is now read-only.

graph-node: Probe doesn't detect provider failure #417

@josedev-union

Description

@josedev-union

Issue

Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.

Jul 12 05:58:20.416 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:20.441 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: store_events, component: NotificationListener
Jul 12 05:58:20.623 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: store_events, component: NotificationListener
Jul 12 05:58:20.639 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:21.634 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: store_events, component: NotificationListener
Jul 12 05:58:21.647 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:59:02.699 ERRO Postgres connection error, error: terminating connection due to administrator command, pool: main, shard: primary, component: ConnectionPool

The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node manuallly.

Expectation

We have an alerting channel which fires alert if eth_rpc_status{provider="xxx"} metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.
If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions