Skip to content

CA-412983: HA doesn't keep trying to start best-effort VM #6619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

minglumlu
Copy link
Member

The issue occurs in a scenario involving a HA-enabled pool. A VM with its VM.ha_restart_priority set to best-effort is running on a host. The VM's disk resides on the host's local storage. When the host goes down, the VM cannot be restarted on other hosts due to the disk's local storage dependency. However, after the host recovers and comes back online, the VM still does not automatically start on the original host.

Expected behavior: The VM should automatically start on the original host once it has recovered. Generally, this behavior should be applied to all non-agile VMs.

The issue occurs in a scenario involving a HA-enabled pool. A VM with its
VM.ha_restart_priority set to best-effort is running on a host. The VM's disk
resides on the host's local storage. When the host goes down, the VM cannot be
restarted on other hosts due to the disk's local storage dependency. However,
after the host recovers and comes back online, the VM still does not
automatically start on the original host.

Expected behavior: The VM should automatically start on the original host once
it has recovered. Generally, this behavior should be applied to all non-agile
VMs.

Signed-off-by: Ming Lu <[email protected]>
tried_best_eff_vms :=
VMMap.update vm
(Option.fold ~none:(Some 1) ~some:(fun n ->
if n < 2 then Some (n + 1) else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoding the constant it would be better to make this a xapi_globs/xapi.conf entry. That way if we notice that the constant is not right in a particular scenario we can tweak it without rebuilding XAPI.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one minor improvement to make the number of retries configurable.

Also how long is the delay between the restart attempts? Exponential backoff might be useful

@minglumlu
Copy link
Member Author

Also how long is the delay between the restart attempts? Exponential backoff might be useful

The delay is just the periodic constant delay of ha_monitor. Currently it is 20 seconds.

@lindig
Copy link
Contributor

lindig commented Aug 8, 2025

I would consider an exponential backoff: retry every after 1,2,4,8, .. 64 minutes. So try every 64 minutes when it has not succeeded earlier. This upper limit could be something else, of course. Could also start with a larger wait initially than 1min.

  • maintain a list of VMs that should be restarted but failed; remember the time of the last attempt; also remember the time to wait for the next attempt (which gets increased)
  • Have a thread that (or other periodic mechanism) that inspects the list periodically and tries to restart VMs that have waited long enough; remove on success and keep with increased time to wait.

@minglumlu
Copy link
Member Author

I would consider an exponential backoff: ...

This retry depends on the HA live changes. So it is embedded within the main loop of Monitor.ha_monitor() which has a 20-second delay on each iteration in which the live set will be updated. A backoff (with a separate thread) within the main loop looks too heavy.

@@ -508,24 +508,26 @@ module Monitor = struct
let liveset_uuids =
List.sort compare (uuids_of_liveset liveset)
in
let to_refs uuids =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refs are just strings; should we not use a Set for efficiency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a sorted list. But yes, a set.mem would be more efficient than List.mem.
I would like to get this fix done first and raise another one for this.

let best_effort_vms =
(* Carefully decide which best-effort VMs should attempt to start. *)
let all_prot_is_ok = List.for_all (fun (_, r) -> r = Ok ()) started in
let is_better = List.length live_set > List.length last_live_set in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Set.cardinal would be also O(n).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small optimization here can be:

Suggested change
let is_better = List.length live_set > List.length last_live_set in
let is_better = List.compare_lengths live_set last_live_set > 0

https://ocaml.org/manual/5.3/api/List.html#VALcompare_lengths

match Db.VM.get_record ~__context ~self with
| r ->
Left (self, r)
| exception _ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use an explanation. The reason we can see an exception is that the VM was deleted by we still have a reference in our VMMap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is to ensure that we don't attempt to start an invalid VM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants