core: allow trying to select new SPM when network problems with current SPM #1027

0ffer · 2025-05-30T07:13:18Z

Changes introduced with this PR

Allow to attempt to select a new SPM host if the old one has some kind of network problems.

In the current behavior, the SPM role may not be transferred when the SPM host switches to the NonResponsive status, because there is a separate check to ensure that the role is not transferred in case of network problems.

But it is transmitted periodically, because a failover may be triggered when calling a specific SPM command (where there is no such check), or when restarting the ovirt-engine service.
We conducted research to figure out which behavior should be correct and safe.

We came to the conclusion that some checks are redundant because

During the selection of a new SPM host, the spmStatus command is called from one of the available hosts in the data center. Based on the information from the sanlock, this host returns the current ID of the SPM host to the ovirt engine. If there is any value there, then additional checks are carried out for the possibility of transferring the role to another host (including for the NonResponsive status).
If the SPM host has lost contact only with the engine, but is still connected to the master storage domain, in this case the sanlock lock will be active, and an error will occur when trying to select a new host for the role of SPM in ovirt-engine.
If the SPM host has lost communication with both the engine and the master storage domain, the sanlock lock will be automatically lifted, and the vdsm itself will be stopped due to the lost lock (i.e., dangerous operations with the storage will not occur). The spmStatus command on any available host in the data center will start returning the status that the SPM role is available. In this case, assigning the SPM role to another host will be safe.

Are you the owner of the code you are sending in, or do you have permission of the owner?

[y/n] -> yes

…nt SPM For now there is some cases when SPM can be safely transited to another host when there is network problems with the current SPM. But some checks not allow to start this process. With this change we allow to start process to select new SPM. If this process not safe, selection process will be stopped on next steps. Signed-off-by: Stanislav Melnichuk <[email protected]>

dupondje · 2025-07-07T12:14:42Z

What if a command has been send to the SPM. And before the SPM executed the command on the storage, it loses connection with the engine. You cannot safely choose another SPM then afaik.
Cause if you then pick another SPM, there is still a chance the command will still get executed? Or that won't be possible because no other SPM can then be chosen cause sanlock (which is always active in all cases?) will not get the lock?

0ffer · 2025-07-17T06:26:18Z

On the SPM host, we have a watchdog that frequently checks the sanlock status. If, for some reason, the host loses connection to the lock, the vdsm service will be forcibly stopped, and no operations can be executed. That is one aspect.

On the other hand, if we try to promote another host to SPM while sanlock is already active on the current SPM, the attempt will fail.

In your case, when the SPM host loses connection only to the engine, sanlock remains active, and no other host can be selected as SPM.

0ffer requested review from ahadas, bennyz, didib, emesika, michalskrivanek, sgratch and smelamud as code owners May 30, 2025 07:13

0ffer force-pushed the allow-new-spm-when-network-error branch from 25cabd3 to 52fe15e Compare May 30, 2025 07:22

0ffer force-pushed the allow-new-spm-when-network-error branch from 52fe15e to a7a4fb7 Compare June 4, 2025 07:42

JasperB-TeamBlue assigned 0ffer Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: allow trying to select new SPM when network problems with current SPM #1027

core: allow trying to select new SPM when network problems with current SPM #1027

Uh oh!

0ffer commented May 30, 2025

Uh oh!

dupondje commented Jul 7, 2025

Uh oh!

0ffer commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

core: allow trying to select new SPM when network problems with current SPM #1027

Are you sure you want to change the base?

core: allow trying to select new SPM when network problems with current SPM #1027

Uh oh!

Conversation

0ffer commented May 30, 2025

Changes introduced with this PR

Are you the owner of the code you are sending in, or do you have permission of the owner?

Uh oh!

dupondje commented Jul 7, 2025

Uh oh!

0ffer commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants