Skip to content

CD kubelet plugin may prepare channel0 multiple times #641

@jgehrcke

Description

@jgehrcke

Observed in a two-node nvbandwidth fault injection test.

Force-deletion of a worker pod can result in a replacement pod being started without an underlying node-local CD daemon pod (and hence w/o a backing IMEX daemon). CUDA memory sharing API calls in the workload then fail with CUDA_ERROR_NOT_SUPPORTED. Example:

[CUDA_ERROR_NOT_SUPPORTED] operation not supported in expression cuMemImportFromShareableHandle(&handle, (void *)&fh, handleType) on nvbandwidth-test-2-worker-1, rank = 3 in MultinodeMemoryAllocationUnicast::MultinodeMemoryAllocationUnicast(size_t, int)() : /bandwidthtest/nvbandwidth/multinode_memcpy.cpp:66

I found this to be racy, but rather frequently reproducible.

At a high level, this happens: the CD daemon local to the killed worker pod disappears permanently, and yet the replacement worker pod transitions to Running (on the same node as before).

Inspection showed that the CD daemon in question was torn down because the CD node label was removed from that node and never re-applied.

I tried to understand why that happened, and found the combination of the following two conditions as the cause:

  1. Quick succession of a resource claim deletion+create event in the API server may result in a corresponding flipped-order prepare+unprepare request sequence emitted by the kubelet to the kubelet plugin.
  2. The CD kubelet plugin currently may allocate channel0 more than once.

In one specific scenario, this was the timeline of events inferred from the CD kubelet plugin log:

40.462 NEW claim prepare request
40.464 NEW claim prepare response (prepare done)
41.014 OLD claim unprepare request
41.029 OLD claim unprepare response (unprepare done)

(first column is relative time in seconds)

Both claims (old, new) are competing for the same device (IMEX channel 0, in our current design). From the timeline, it's clear that the new claim was prepared before the old claim was unprepared.

It's easy to say now that this prepare-before-unprepare should not be allowed.

The fallout in detail is the following sequence of events:

  1. New claim prepare checks if node in CD is READY: yes
  2. New claim prepare adds CD node label: noop, label already there
  3. New claim prepare checks node-readiness in CD, looks good -- workload pod gets released
  4. Old claim unprepare removes node label and triggers CD daemon teardown -- node after all gets removed from CD status.

(4) results in (permanent) CD daemon teardown underneath the workload pod, and removal of the node from CD status.

Conclusion + solution:

  • It makes sense that no assumptions should be made about the order of prepare and unprepare requests emitted by the kubelet to a kubelet plugin.
  • Before making a prepare() call succeed, we need to actually make sure that the device-being-asked-for isn't currently allocated by/for a different ResourceClaim. The source of truth for answering that is node-local state only (as represented by the kubelet checkpoint data). I can see that we knew this all along and simply forgot.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugrobustnessissue/pr: edge cases & fault tolerance

Type

No type

Projects

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions