Skip to content

Conversation

@0ffer
Copy link
Contributor

@0ffer 0ffer commented May 30, 2025

Changes introduced with this PR

Allow to attempt to select a new SPM host if the old one has some kind of network problems.

In the current behavior, the SPM role may not be transferred when the SPM host switches to the NonResponsive status, because there is a separate check to ensure that the role is not transferred in case of network problems.

But it is transmitted periodically, because a failover may be triggered when calling a specific SPM command (where there is no such check), or when restarting the ovirt-engine service.
We conducted research to figure out which behavior should be correct and safe.

We came to the conclusion that some checks are redundant because

  1. During the selection of a new SPM host, the spmStatus command is called from one of the available hosts in the data center. Based on the information from the sanlock, this host returns the current ID of the SPM host to the ovirt engine. If there is any value there, then additional checks are carried out for the possibility of transferring the role to another host (including for the NonResponsive status).
  2. If the SPM host has lost contact only with the engine, but is still connected to the master storage domain, in this case the sanlock lock will be active, and an error will occur when trying to select a new host for the role of SPM in ovirt-engine.
  3. If the SPM host has lost communication with both the engine and the master storage domain, the sanlock lock will be automatically lifted, and the vdsm itself will be stopped due to the lost lock (i.e., dangerous operations with the storage will not occur). The spmStatus command on any available host in the data center will start returning the status that the SPM role is available. In this case, assigning the SPM role to another host will be safe.

Are you the owner of the code you are sending in, or do you have permission of the owner?

[y/n] -> yes

@0ffer 0ffer force-pushed the allow-new-spm-when-network-error branch from 25cabd3 to 52fe15e Compare May 30, 2025 07:22
…nt SPM

For now there is some cases when SPM can be safely transited to another host when there is network problems with the current SPM.
But some checks not allow to start this process.
With this change we allow to start process to select new SPM.
If this process not safe, selection process will be stopped on next steps.

Signed-off-by: Stanislav Melnichuk <[email protected]>
@0ffer 0ffer force-pushed the allow-new-spm-when-network-error branch from 52fe15e to a7a4fb7 Compare June 4, 2025 07:42
@dupondje
Copy link
Member

dupondje commented Jul 7, 2025

What if a command has been send to the SPM. And before the SPM executed the command on the storage, it loses connection with the engine. You cannot safely choose another SPM then afaik.
Cause if you then pick another SPM, there is still a chance the command will still get executed? Or that won't be possible because no other SPM can then be chosen cause sanlock (which is always active in all cases?) will not get the lock?

@0ffer
Copy link
Contributor Author

0ffer commented Jul 17, 2025

On the SPM host, we have a watchdog that frequently checks the sanlock status. If, for some reason, the host loses connection to the lock, the vdsm service will be forcibly stopped, and no operations can be executed. That is one aspect.

On the other hand, if we try to promote another host to SPM while sanlock is already active on the current SPM, the attempt will fail.

In your case, when the SPM host loses connection only to the engine, sanlock remains active, and no other host can be selected as SPM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants