Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented May 21, 2025

A tiny patch to collect feedback on an idea before I continue.

Currently, an unexpected JSON document in the checkpoint file may result in lots of retrying. Logs will show the following type of message repeatedly:

E0520 19:57:59.310850       1 workqueue.go:99] Failed to reconcile work item: 
    error unpreparing devices for claim 'default/imex-<snip>e5':
    unable to get checkpoint: json: cannot unmarshal array into Go struct field
        CheckpointV1.v1.preparedClaims of type main.PreparedClaim

(reformatted to be less wide)

The concept of permanentError is in the code base already. Here, we can make use of it.

The complete set of possible errors that can be returned by GetCheckpoint() is ~unknown (it could be known, but it is not small: there's I/O, there's JSON decoding, there's checksum verification, and maybe more).

Current assumption: it is worse to not retry a retryable error than it is to retry a permanent error. That assumption clearly wants to be tested. Thoughts? The kubelet might make us retry something that we have previously declared as "permanent error" anyway, right?

Maybe one obvious question: should we treat any error returned by GetCheckpoint() as permanent? Currently, I think the answer should be "no" because i) I/O is involved and -- even if rare -- I/O may retryably fail, and ii) I just don't know at this point if there are other retryable errors happening here, and I better assume so.

Another question: do we want to distinguish different permanent errors (so that we can generate a dedicated error message prefix for each), or do we want to honor them with a single code path and err msg prefix? I don't care too much as of now. I have picked one option for now and appreciate feedback.

Should we use errors.Is() or errors.As() here? I think Is() is safe because I didn't see any wrapping. But maybe we want to use As() to be on the safe side.

Noteworthy: there are four locations in code where we call GetCheckpoint() (prepare+unprepare, gpu+CD). I'd like to work towards a common implementation for all places, and appreciate feedback towards that (should we start a new "library", consumed by both plugins)?

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

// may be retryable. For now, handle an explicit set of errors as
// permanent, and treat every other error as retryable.

if errors.Is(err, &json.UnmarshalTypeError{}) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there is no overarching error type representing any error during Unmarshal().

As of https://pkg.go.dev/encoding/json#UnmarshalTypeError, Unmarshal() can actually return three different types of error:

Unmarshal returns an InvalidUnmarshalError.

Unmarshal returns a SyntaxError.

Unmarshal returns an UnmarshalTypeError

SyntaxError means "bad JSON", we also want to treat that as permanent.

What would be a concise way to test "is this error type any of these...?"

@klueska
Copy link
Collaborator

klueska commented May 21, 2025

The kubelet might make us retry something that we have previously declared as "permanent error" anyway, right?

The kubelet will definitely make us retry. There is currently no way to distinguish between transient / permanent errors passed back to the kubelet. Having the notion in our driver is there to future proof us once the kubelet does support this / not keep us looping in the driver unnecessarily on a permanent error -- we return immediately back to the kubelet who has a longer timeout than we in the driver do.

@klueska
Copy link
Collaborator

klueska commented May 21, 2025

Unless we are absolutely certain about which one it should be, I honestly wouldn't worry about the distinction between transient / permanent errors at this granularity yet. We can slowly add them in preparation for the kubelet even supporting them at all.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented May 21, 2025

The kubelet will definitely make us retry

👍

[...] once the kubelet does support this [...]

I see. So for now, a false-positive permanent error isn't the end of the world yet. But it will be a rather painful thing in the future.

not keep us looping in the driver unnecessarily on a permanent error

Yes, so that's the motivation here.

Screenshot of a log bomb here, for the record (note the timestamps, too):
https://github.com/user-attachments/assets/e31ca03d-bdd2-4cf6-9f93-c9deaff82732

That's what we should address (elsewhere, we noted that we probably want to tune the retry parameters, too).

Unless we are absolutely certain about which one it should be

We are certain that json.UnmarshalTypeError is reflecting a generally valid JSON document with an unexpected schema. That one we can safely categorize as a permanent error.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jul 1, 2025

Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants