-
Notifications
You must be signed in to change notification settings - Fork 30
Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nimize EKS Auth service calls
| defaultThrottlingKey = "default-throttling-key" | ||
| defaultThrottlingMsg = "account throttling" | ||
| defaultThrottledRequestCacheSize = 10 | ||
| defaultActiveRequestRetries = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 1 retry per request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a combination of wait time & number of retries. defaultActiveRequestRetries * 1 sec = 1 sec, which is the default timeout for AWS credential provider: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/ ((search "a default value of 10001 (one second))
Actually it would be better if it is a bit shy of 1 sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the metrics of EKS Auth service, most of the requests are under 200ms, and with 1s default credential provider timeout, it would make more sense to retry 4 times with a wait of 200ms each time. I made a commit for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add jitter and backoff, not constantly repetition on retries?
| // Check if the request is in the cache, if it is, return it | ||
| if val, ok := r.internalCache.Get(request.ServiceAccountToken); ok { | ||
| if _, withinTtl := r.credentialsInEntryWithinValidTtl(val); withinTtl { | ||
| log.WithField("cache-hit", 1).Tracef("Using cached credentials") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to log the caller too? Do we have a pod IP we can log for differentiation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been handled in
| "client-addr": r.RemoteAddr, |
| if err != nil { | ||
| _, statusCode := errors.HandleCredentialFetchingError(ctx, err) | ||
| if statusCode == http.StatusTooManyRequests { | ||
| r.internalThrottledRequestCache.SetWithExpire(defaultThrottlingKey, true, 1*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why hardcode to 1 second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so account quota can be reset in the next second per the throttling rule
| r.internalCache.Delete(request.ServiceAccountToken) | ||
| if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok { | ||
| // Wait 1s for active request to finish caching into internalCache | ||
| time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hardcode to 1 second? Don't we want any backoff/jitter we ever retry more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #39 (comment) and #39 (comment)
|
For CI failure, let's explicitly add the write permission to the job so the forked PR can also have write access |
|
|
||
| log.Info("Identified that entry in cache contains credentials with small ttl or invalid ttl, will be deleted") | ||
| r.internalCache.Delete(request.ServiceAccountToken) | ||
| if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this isnt using concurrent map operations so there is still a a chance duplicate downstream calls are made depending on the order of operations. I feel we can do better to fully eliminate that possiblity which is of course the point of this change.
We need the equivalent of load-or-store semantics (not sure if that's offered by the expiring cache or not). If this thread successfully stores the value then it proceeds with making the downstream call. Otherwise (i.e. this thread loaded the existing value) then it should wait. To wait, it would make sense to me to have the value be something like a wait group or some other synchronization primitive that the waiting thread can be notified through when the outstanding request complets.
Here would be the psuedo code for what I am suggesting:
var wg sync.WaitGroup
setWg, stored := r.internalActiveRequestCache.LoadOrStore(request.ServiceAccountToken, wg)
if !stored {
// Another request is outstanding, wait on this request
setWg.Wait()
val := internalCache.Get(request.ServiceAccountToken)
// missing: adding checks in case the outstanding request failed and the val wasnt properly set
return val
}
// Otherwise, we were the ones who successfully set the value, so we are responsible for making the
// delegate call
defer func() {
wg.Done()
internalActiveRequestCache.Delete(request.ServiceAccountToken)
}()
iamCredentials, metadata, err := r.delegate.GetIamCredentials(ctx, request)
...Something like this.
One downside of this particular approach is that there is no timeout on waiting for the outstanding request to finish. So if there is an issue there, then the thread would be waiting forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v1/credentials endpoint of eks-pod-identity-agent is served as AWS_CONTAINER_CREDENTIALS_FULL_URI of the application Pod, as mentioned above in #39 (comment), there is a default 1 sec timeout in container credential provider (search "a default value of `10001` (one second) is used" in the link)
Thus we wouldn't want the requests to perfectly waiting for the previous one, if the previous request is not finished in 1 sec, it is fine they go ahead and also try to reach to Auth service and cache
In the unit test, I tested with 32 concurrent requests, and use 16 in the commit (search for numRequests = 16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of thread safe, I think you are suggesting to use sync.Map? The expiring.cache is using Mutex lock for handling concurrency, see all the operations it implements in https://github.com/aws/eks-pod-identity-agent/blob/main/internal/cache/expiring/cache.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like Modify could support Load-or-Store semantics
Fixed |
| defaultThrottlingMsg = "account throttling" | ||
| defaultThrottledRequestCacheSize = 10 | ||
| // default timeout for AWS credential provider is 1 sec: | ||
| // https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we look at the Go SDK for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an application Pod, it could have Java runtime, Go runtime, or Python, etc.
Do you have any suggestion on the values based on the Go & Java SDK http default client configuration for credential endpoint provider:
For Go SDK v2, there is not an explicit read timeout in endpointcreds:
For Java, that seems to 1 sec:
https://github.com/aws/aws-sdk-java/blob/9197e60ae6388e4b00da4cc0a444f161ad49aa6a/aws-java-sdk-core/src/main/java/com/amazonaws/internal/EC2MetadataClient.java#L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: so if the active request retries fail, then we will make a request, is that right?
If so, that makes me somewhat nervous. In general, the idea of 'constant work' is something that we should strive to employ. This means that code acts as consistently as it can in error sceniors or happy scenarios. Colm has a good blog about this here: https://aws.amazon.com/builders-library/reliability-and-constant-work/. In this case, if there is an issue in EKS Auth or anywhere in between and the request fails, we will start making more requests to EKS Auth. Imagine a scenario where EKS Auth is browning out and not responding on time - this code would actually potentially INCREASE the volume of requests to EKS Auth and compound the problem. Or what if we are getting throttled. That's something in general you want to avoid.
If the active request failed, should we not just fail ourselves? Or we can alternatively see about retrying but only having one retry at a time (as it stands now, all the waiting requests would fail)?
Good suggestion, I have made another commit to return from active request cache for other requests, in case of error. |
| if errActiveRequest, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok { | ||
| // if there is an error from the active request, return the error | ||
| if errActiveRequest != nil { | ||
| log.Errorf("Failed the request with error from the same active request: %v\n", errActiveRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non blocking nit: log lines shouldn't need a newline suffix
| defer func() { | ||
| if err == nil { | ||
| r.internalActiveRequestCache.Delete(request.ServiceAccountToken) | ||
| } else { | ||
| r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval) | ||
| } | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non blocking nit: for a deferred func, you probably want to explicitly pass the error to the function
| defer func() { | |
| if err == nil { | |
| r.internalActiveRequestCache.Delete(request.ServiceAccountToken) | |
| } else { | |
| r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval) | |
| } | |
| }() | |
| defer func(err error) { | |
| if err == nil { | |
| r.internalActiveRequestCache.Delete(request.ServiceAccountToken) | |
| } else { | |
| r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval) | |
| } | |
| }(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Micah, will follow up on these 2, I merged the change because this is actually covered by unit test, we are able to verify that error is stored in the memory cache.
…nimize EKS Auth service calls (aws#39) * Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls * defaultActiveRequestRetries and defaultActiveRequestWaitTime * Fix CI workflow * Return from internalActiveRequestCache for other requests in case of error
Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls
Issue #, if available: #34
Description of changes:
This is achieved by an LRU cache of
internalActiveRequestCache, when aserviceAccountTokenis present ininternalActiveRequestCache, it means there are multiple concurrent active requests on the sameserviceAccountToken, in other words, other requests can wait a bit on the first oneThis is achieved by a cache of
internalThrottledRequestCache, when account throttling happens, the default key is added intointernalThrottledRequestCachewith a 1 sec expirationUnit tests are added to test the changes
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.