Identify permanent errors upon reading checkpoint #375

jgehrcke · 2025-05-21T14:30:48Z

A tiny patch to collect feedback on an idea before I continue.

Currently, an unexpected JSON document in the checkpoint file may result in lots of retrying. Logs will show the following type of message repeatedly:

E0520 19:57:59.310850       1 workqueue.go:99] Failed to reconcile work item: 
    error unpreparing devices for claim 'default/imex-<snip>e5':
    unable to get checkpoint: json: cannot unmarshal array into Go struct field
        CheckpointV1.v1.preparedClaims of type main.PreparedClaim

(reformatted to be less wide)

The concept of permanentError is in the code base already. Here, we can make use of it.

The complete set of possible errors that can be returned by GetCheckpoint() is ~unknown (it could be known, but it is not small: there's I/O, there's JSON decoding, there's checksum verification, and maybe more).

Current assumption: it is worse to not retry a retryable error than it is to retry a permanent error. That assumption clearly wants to be tested. Thoughts? The kubelet might make us retry something that we have previously declared as "permanent error" anyway, right?

Maybe one obvious question: should we treat any error returned by GetCheckpoint() as permanent? Currently, I think the answer should be "no" because i) I/O is involved and -- even if rare -- I/O may retryably fail, and ii) I just don't know at this point if there are other retryable errors happening here, and I better assume so.

Another question: do we want to distinguish different permanent errors (so that we can generate a dedicated error message prefix for each), or do we want to honor them with a single code path and err msg prefix? I don't care too much as of now. I have picked one option for now and appreciate feedback.

Should we use errors.Is() or errors.As() here? I think Is() is safe because I didn't see any wrapping. But maybe we want to use As() to be on the safe side.

Noteworthy: there are four locations in code where we call GetCheckpoint() (prepare+unprepare, gpu+CD). I'd like to work towards a common implementation for all places, and appreciate feedback towards that (should we start a new "library", consumed by both plugins)?

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

copy-pr-bot · 2025-05-21T14:30:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jgehrcke · 2025-05-21T14:34:21Z

cmd/compute-domain-kubelet-plugin/device_state.go

+		// may be retryable. For now, handle an explicit set of errors as
+		// permanent, and treat every other error as retryable.
+
+		if errors.Is(err, &json.UnmarshalTypeError{}) {


Note: there is no overarching error type representing any error during Unmarshal().

As of https://pkg.go.dev/encoding/json#UnmarshalTypeError, Unmarshal() can actually return three different types of error:

Unmarshal returns an InvalidUnmarshalError.

Unmarshal returns a SyntaxError.

Unmarshal returns an UnmarshalTypeError

SyntaxError means "bad JSON", we also want to treat that as permanent.

What would be a concise way to test "is this error type any of these...?"

klueska · 2025-05-21T14:41:58Z

The kubelet might make us retry something that we have previously declared as "permanent error" anyway, right?

The kubelet will definitely make us retry. There is currently no way to distinguish between transient / permanent errors passed back to the kubelet. Having the notion in our driver is there to future proof us once the kubelet does support this / not keep us looping in the driver unnecessarily on a permanent error -- we return immediately back to the kubelet who has a longer timeout than we in the driver do.

klueska · 2025-05-21T14:44:05Z

Unless we are absolutely certain about which one it should be, I honestly wouldn't worry about the distinction between transient / permanent errors at this granularity yet. We can slowly add them in preparation for the kubelet even supporting them at all.

jgehrcke · 2025-05-21T15:01:10Z

The kubelet will definitely make us retry

👍

[...] once the kubelet does support this [...]

I see. So for now, a false-positive permanent error isn't the end of the world yet. But it will be a rather painful thing in the future.

not keep us looping in the driver unnecessarily on a permanent error

Yes, so that's the motivation here.

Screenshot of a log bomb here, for the record (note the timestamps, too):
https://github.com/user-attachments/assets/e31ca03d-bdd2-4cf6-9f93-c9deaff82732

That's what we should address (elsewhere, we noted that we probably want to tune the retry parameters, too).

Unless we are absolutely certain about which one it should be

We are certain that json.UnmarshalTypeError is reflecting a generally valid JSON document with an unexpected schema. That one we can safely categorize as a permanent error.

jgehrcke · 2025-07-01T11:14:25Z

Closing this for now.

Identify permanent errors upon reading checkpoint

e827c18

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke commented May 21, 2025

View reviewed changes

jgehrcke mentioned this pull request Jun 10, 2025

Migrate checkpoint format from the format in 25.3.0-rc.2 to the latest #394

Merged

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

jgehrcke closed this Jul 1, 2025

github-project-automation bot moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Jul 1, 2025

This was referenced Sep 18, 2025

Return strict-decode errors in NodePrepareResources() as permanentError #599

Closed

Test for strict opaque config-decoding in prepare path #598

Merged

jgehrcke mentioned this pull request Oct 8, 2025

Use custom rate limiter for kubelet plugin (un)prepare retries #655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Identify permanent errors upon reading checkpoint #375

Identify permanent errors upon reading checkpoint #375

Uh oh!

jgehrcke commented May 21, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented May 21, 2025

Uh oh!

jgehrcke May 21, 2025

Uh oh!

klueska commented May 21, 2025

Uh oh!

klueska commented May 21, 2025

Uh oh!

jgehrcke commented May 21, 2025 •

edited

Loading

Uh oh!

jgehrcke commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Identify permanent errors upon reading checkpoint #375

Identify permanent errors upon reading checkpoint #375

Uh oh!

Conversation

jgehrcke commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented May 21, 2025

Uh oh!

jgehrcke May 21, 2025

Choose a reason for hiding this comment

Uh oh!

klueska commented May 21, 2025

Uh oh!

klueska commented May 21, 2025

Uh oh!

jgehrcke commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgehrcke commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgehrcke commented May 21, 2025 •

edited

Loading

jgehrcke commented May 21, 2025 •

edited

Loading