-
Notifications
You must be signed in to change notification settings - Fork 292
CA-412983: HA doesn't keep trying to start best-effort VM #6619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The issue occurs in a scenario involving a HA-enabled pool. A VM with its VM.ha_restart_priority set to best-effort is running on a host. The VM's disk resides on the host's local storage. When the host goes down, the VM cannot be restarted on other hosts due to the disk's local storage dependency. However, after the host recovers and comes back online, the VM still does not automatically start on the original host. Expected behavior: The VM should automatically start on the original host once it has recovered. Generally, this behavior should be applied to all non-agile VMs. Signed-off-by: Ming Lu <[email protected]>
ocaml/xapi/xapi_ha_vm_failover.ml
Outdated
tried_best_eff_vms := | ||
VMMap.update vm | ||
(Option.fold ~none:(Some 1) ~some:(fun n -> | ||
if n < 2 then Some (n + 1) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoding the constant it would be better to make this a xapi_globs/xapi.conf entry. That way if we notice that the constant is not right in a particular scenario we can tweak it without rebuilding XAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one minor improvement to make the number of retries configurable.
Also how long is the delay between the restart attempts? Exponential backoff might be useful
The delay is just the periodic constant delay of ha_monitor. Currently it is 20 seconds. |
I would consider an exponential backoff: retry every after 1,2,4,8, .. 64 minutes. So try every 64 minutes when it has not succeeded earlier. This upper limit could be something else, of course. Could also start with a larger wait initially than 1min.
|
This retry depends on the HA live changes. So it is embedded within the main loop of Monitor.ha_monitor() which has a 20-second delay on each iteration in which the live set will be updated. A backoff (with a separate thread) within the main loop looks too heavy. |
@@ -508,24 +508,26 @@ module Monitor = struct | |||
let liveset_uuids = | |||
List.sort compare (uuids_of_liveset liveset) | |||
in | |||
let to_refs uuids = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refs are just strings; should we not use a Set
for efficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a sorted list. But yes, a set.mem would be more efficient than List.mem.
I would like to get this fix done first and raise another one for this.
ocaml/xapi/xapi_ha_vm_failover.ml
Outdated
let best_effort_vms = | ||
(* Carefully decide which best-effort VMs should attempt to start. *) | ||
let all_prot_is_ok = List.for_all (fun (_, r) -> r = Ok ()) started in | ||
let is_better = List.length live_set > List.length last_live_set in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, Set.cardinal
would be also O(n).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small optimization here can be:
let is_better = List.length live_set > List.length last_live_set in | |
let is_better = List.compare_lengths live_set last_live_set > 0 |
https://ocaml.org/manual/5.3/api/List.html#VALcompare_lengths
match Db.VM.get_record ~__context ~self with | ||
| r -> | ||
Left (self, r) | ||
| exception _ -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use an explanation. The reason we can see an exception is that the VM was deleted by we still have a reference in our VMMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is to ensure that we don't attempt to start an invalid VM.
Signed-off-by: Ming Lu <[email protected]>
Signed-off-by: Ming Lu <[email protected]>
The issue occurs in a scenario involving a HA-enabled pool. A VM with its VM.ha_restart_priority set to best-effort is running on a host. The VM's disk resides on the host's local storage. When the host goes down, the VM cannot be restarted on other hosts due to the disk's local storage dependency. However, after the host recovers and comes back online, the VM still does not automatically start on the original host.
Expected behavior: The VM should automatically start on the original host once it has recovered. Generally, this behavior should be applied to all non-agile VMs.