Skip to content

Manual recovery after "gcs connection failed" with one statefulset pod gone #31

Open
@solsson

Description

@solsson

On preemtible nodes we had one instance of manual recovery, after #30. There was no mariadb-1 pod, and -0 and -2 stayed crashlooping. The were past init but the mariadb containers exited after:

2020-05-27  6:10:26 140050350966464 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50176S), skipping check
2020-05-27  6:10:55 140050350966464 [Note] WSREP: view((empty))
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():158
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'my_wsrep_cluster' at 'gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb': -110 (Connection timed out)
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb) failed: 7
2020-05-27  6:10:55 140050350966464 [ERROR] Aborting

This could be a case for switching from OrderedReady to Parrallel.

The solution was to scale down to zero and then back up to three again. Oddly the pods wouldn't go away at scale to zero, so I had to manually delete mariadb-2. Is that the expected behavior for OrderedReady?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions