-
Notifications
You must be signed in to change notification settings - Fork 98
Identify permanent errors upon reading checkpoint #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
| // may be retryable. For now, handle an explicit set of errors as | ||
| // permanent, and treat every other error as retryable. | ||
|
|
||
| if errors.Is(err, &json.UnmarshalTypeError{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: there is no overarching error type representing any error during Unmarshal().
As of https://pkg.go.dev/encoding/json#UnmarshalTypeError, Unmarshal() can actually return three different types of error:
Unmarshal returns an InvalidUnmarshalError.
Unmarshal returns a SyntaxError.
Unmarshal returns an UnmarshalTypeError
SyntaxError means "bad JSON", we also want to treat that as permanent.
What would be a concise way to test "is this error type any of these...?"
The kubelet will definitely make us retry. There is currently no way to distinguish between transient / permanent errors passed back to the kubelet. Having the notion in our driver is there to future proof us once the kubelet does support this / not keep us looping in the driver unnecessarily on a permanent error -- we return immediately back to the kubelet who has a longer timeout than we in the driver do. |
|
Unless we are absolutely certain about which one it should be, I honestly wouldn't worry about the distinction between transient / permanent errors at this granularity yet. We can slowly add them in preparation for the kubelet even supporting them at all. |
👍
I see. So for now, a false-positive permanent error isn't the end of the world yet. But it will be a rather painful thing in the future.
Yes, so that's the motivation here. Screenshot of a log bomb here, for the record (note the timestamps, too): That's what we should address (elsewhere, we noted that we probably want to tune the retry parameters, too).
We are certain that |
|
Closing this for now. |
A tiny patch to collect feedback on an idea before I continue.
Currently, an unexpected JSON document in the checkpoint file may result in lots of retrying. Logs will show the following type of message repeatedly:
(reformatted to be less wide)
The concept of
permanentErroris in the code base already. Here, we can make use of it.The complete set of possible errors that can be returned by
GetCheckpoint()is ~unknown (it could be known, but it is not small: there's I/O, there's JSON decoding, there's checksum verification, and maybe more).Current assumption: it is worse to not retry a retryable error than it is to retry a permanent error. That assumption clearly wants to be tested. Thoughts? The kubelet might make us retry something that we have previously declared as "permanent error" anyway, right?
Maybe one obvious question: should we treat any error returned by
GetCheckpoint()as permanent? Currently, I think the answer should be "no" because i) I/O is involved and -- even if rare -- I/O may retryably fail, and ii) I just don't know at this point if there are other retryable errors happening here, and I better assume so.Another question: do we want to distinguish different permanent errors (so that we can generate a dedicated error message prefix for each), or do we want to honor them with a single code path and err msg prefix? I don't care too much as of now. I have picked one option for now and appreciate feedback.
Should we use
errors.Is()orerrors.As()here? I thinkIs()is safe because I didn't see any wrapping. But maybe we want to useAs()to be on the safe side.Noteworthy: there are four locations in code where we call
GetCheckpoint()(prepare+unprepare, gpu+CD). I'd like to work towards a common implementation for all places, and appreciate feedback towards that (should we start a new "library", consumed by both plugins)?