CA-412983: HA doesn't keep trying to start best-effort VM #6619

minglumlu · 2025-08-08T04:51:45Z

The issue occurs in a scenario involving a HA-enabled pool. A VM with its VM.ha_restart_priority set to best-effort is running on a host. The VM's disk resides on the host's local storage. When the host goes down, the VM cannot be restarted on other hosts due to the disk's local storage dependency. However, after the host recovers and comes back online, the VM still does not automatically start on the original host.

Expected behavior: The VM should automatically start on the original host once it has recovered. Generally, this behavior should be applied to all non-agile VMs.

The issue occurs in a scenario involving a HA-enabled pool. A VM with its VM.ha_restart_priority set to best-effort is running on a host. The VM's disk resides on the host's local storage. When the host goes down, the VM cannot be restarted on other hosts due to the disk's local storage dependency. However, after the host recovers and comes back online, the VM still does not automatically start on the original host. Expected behavior: The VM should automatically start on the original host once it has recovered. Generally, this behavior should be applied to all non-agile VMs. Signed-off-by: Ming Lu <[email protected]>

edwintorok · 2025-08-08T08:25:03Z

ocaml/xapi/xapi_ha_vm_failover.ml

+                 tried_best_eff_vms :=
+                   VMMap.update vm
+                     (Option.fold ~none:(Some 1) ~some:(fun n ->
+                          if n < 2 then Some (n + 1) else None


Instead of hardcoding the constant it would be better to make this a xapi_globs/xapi.conf entry. That way if we notice that the constant is not right in a particular scenario we can tweak it without rebuilding XAPI.

edwintorok

Looks good, one minor improvement to make the number of retries configurable.

Also how long is the delay between the restart attempts? Exponential backoff might be useful

minglumlu · 2025-08-08T11:52:53Z

Also how long is the delay between the restart attempts? Exponential backoff might be useful

The delay is just the periodic constant delay of ha_monitor. Currently it is 20 seconds.

lindig · 2025-08-08T14:31:30Z

I would consider an exponential backoff: retry every after 1,2,4,8, .. 64 minutes. So try every 64 minutes when it has not succeeded earlier. This upper limit could be something else, of course. Could also start with a larger wait initially than 1min.

maintain a list of VMs that should be restarted but failed; remember the time of the last attempt; also remember the time to wait for the next attempt (which gets increased)
Have a thread that (or other periodic mechanism) that inspects the list periodically and tries to restart VMs that have waited long enough; remove on success and keep with increased time to wait.

minglumlu · 2025-08-11T05:14:02Z

I would consider an exponential backoff: ...

This retry depends on the HA live changes. So it is embedded within the main loop of Monitor.ha_monitor() which has a 20-second delay on each iteration in which the live set will be updated. A backoff (with a separate thread) within the main loop looks too heavy.

lindig · 2025-08-11T08:56:31Z

ocaml/xapi/xapi_ha.ml

@@ -508,24 +508,26 @@ module Monitor = struct
              let liveset_uuids =
                List.sort compare (uuids_of_liveset liveset)
              in
+              let to_refs uuids =


Refs are just strings; should we not use a Set for efficiency?

This is actually a sorted list. But yes, a set.mem would be more efficient than List.mem.
I would like to get this fix done first and raise another one for this.

lindig · 2025-08-11T09:02:06Z

ocaml/xapi/xapi_ha_vm_failover.ml

+      let best_effort_vms =
+        (* Carefully decide which best-effort VMs should attempt to start. *)
+        let all_prot_is_ok = List.for_all (fun (_, r) -> r = Ok ()) started in
+        let is_better = List.length live_set > List.length last_live_set in


Unfortunately, Set.cardinal would be also O(n).

A small optimization here can be:

Suggested change

let is_better = List.length live_set > List.length last_live_set in

let is_better = List.compare_lengths live_set last_live_set > 0

https://ocaml.org/manual/5.3/api/List.html#VALcompare_lengths

lindig · 2025-08-11T09:04:58Z

ocaml/xapi/xapi_ha_vm_failover.ml

+                 match Db.VM.get_record ~__context ~self with
+                 | r ->
+                     Left (self, r)
+                 | exception _ ->


This could use an explanation. The reason we can see an exception is that the VM was deleted by we still have a reference in our VMMap?

Yes. This is to ensure that we don't attempt to start an invalid VM.

Signed-off-by: Ming Lu <[email protected]>

edwintorok reviewed Aug 8, 2025

View reviewed changes

lindig approved these changes Aug 11, 2025

View reviewed changes

minglumlu added 2 commits August 12, 2025 11:51

Add Xapi_globs.ha_best_effort_max_retries to eliminate hard-coding

ea89a26

Signed-off-by: Ming Lu <[email protected]>

Optimize with List.compare_lengths

34bdb57

Signed-off-by: Ming Lu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA-412983: HA doesn't keep trying to start best-effort VM #6619

CA-412983: HA doesn't keep trying to start best-effort VM #6619

Uh oh!

minglumlu commented Aug 8, 2025

Uh oh!

edwintorok Aug 8, 2025

Uh oh!

edwintorok left a comment

Uh oh!

minglumlu commented Aug 8, 2025

Uh oh!

lindig commented Aug 8, 2025 •

edited

Loading

Uh oh!

minglumlu commented Aug 11, 2025

Uh oh!

lindig Aug 11, 2025

Uh oh!

minglumlu Aug 12, 2025

Uh oh!

lindig Aug 11, 2025

Uh oh!

psafont Aug 12, 2025

Uh oh!

lindig Aug 11, 2025

Uh oh!

minglumlu Aug 12, 2025

Uh oh!

Uh oh!

	let is_better = List.length live_set > List.length last_live_set in
	let is_better = List.compare_lengths live_set last_live_set > 0

CA-412983: HA doesn't keep trying to start best-effort VM #6619

Are you sure you want to change the base?

CA-412983: HA doesn't keep trying to start best-effort VM #6619

Uh oh!

Conversation

minglumlu commented Aug 8, 2025

Uh oh!

edwintorok Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

edwintorok left a comment

Choose a reason for hiding this comment

Uh oh!

minglumlu commented Aug 8, 2025

Uh oh!

lindig commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minglumlu commented Aug 11, 2025

Uh oh!

lindig Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

minglumlu Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

lindig Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

psafont Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

lindig Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

minglumlu Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lindig commented Aug 8, 2025 •

edited

Loading