Skip to content

Conversation

@xdu31
Copy link
Contributor

@xdu31 xdu31 commented Jan 7, 2025

Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls

Issue #, if available: #34

Description of changes:

Dedupe concurrent active requests on the same service token, wait for the first request to finish cache, rather than making calls to EKS Auth service & cache to memory at the same time

This is achieved by an LRU cache of internalActiveRequestCache, when a serviceAccountToken is present in internalActiveRequestCache, it means there are multiple concurrent active requests on the same serviceAccountToken, in other words, other requests can wait a bit on the first one

Pause calls to EKS Auth service for 1 sec when account is being throttled to avoid flooding requests to EKS Auth service to avoid further stress to the system

This is achieved by a cache of internalThrottledRequestCache, when account throttling happens, the default key is added into internalThrottledRequestCache with a 1 sec expiration

Unit tests are added to test the changes

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@xdu31 xdu31 requested a review from a team as a code owner January 7, 2025 14:09
defaultThrottlingKey = "default-throttling-key"
defaultThrottlingMsg = "account throttling"
defaultThrottledRequestCacheSize = 10
defaultActiveRequestRetries = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 1 retry per request?

Copy link
Contributor Author

@xdu31 xdu31 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a combination of wait time & number of retries. defaultActiveRequestRetries * 1 sec = 1 sec, which is the default timeout for AWS credential provider: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/ ((search "a default value of 10001 (one second))

Actually it would be better if it is a bit shy of 1 sec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the metrics of EKS Auth service, most of the requests are under 200ms, and with 1s default credential provider timeout, it would make more sense to retry 4 times with a wait of 200ms each time. I made a commit for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add jitter and backoff, not constantly repetition on retries?

// Check if the request is in the cache, if it is, return it
if val, ok := r.internalCache.Get(request.ServiceAccountToken); ok {
if _, withinTtl := r.credentialsInEntryWithinValidTtl(val); withinTtl {
log.WithField("cache-hit", 1).Tracef("Using cached credentials")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log the caller too? Do we have a pod IP we can log for differentiation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been handled in

if err != nil {
_, statusCode := errors.HandleCredentialFetchingError(ctx, err)
if statusCode == http.StatusTooManyRequests {
r.internalThrottledRequestCache.SetWithExpire(defaultThrottlingKey, true, 1*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcode to 1 second?

Copy link
Contributor Author

@xdu31 xdu31 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so account quota can be reset in the next second per the throttling rule

r.internalCache.Delete(request.ServiceAccountToken)
if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {
// Wait 1s for active request to finish caching into internalCache
time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcode to 1 second? Don't we want any backoff/jitter we ever retry more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haoranleo
Copy link
Contributor

For CI failure, let's explicitly add the write permission to the job so the forked PR can also have write access

jobs:
  test:
    ...
    permissions:
      contents: write
    steps:
      ...

e.g. https://github.com/kubernetes-sigs/aws-iam-authenticator/blob/master/.github/workflows/create-release.yml#L16


log.Info("Identified that entry in cache contains credentials with small ttl or invalid ttl, will be deleted")
r.internalCache.Delete(request.ServiceAccountToken)
if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this isnt using concurrent map operations so there is still a a chance duplicate downstream calls are made depending on the order of operations. I feel we can do better to fully eliminate that possiblity which is of course the point of this change.

We need the equivalent of load-or-store semantics (not sure if that's offered by the expiring cache or not). If this thread successfully stores the value then it proceeds with making the downstream call. Otherwise (i.e. this thread loaded the existing value) then it should wait. To wait, it would make sense to me to have the value be something like a wait group or some other synchronization primitive that the waiting thread can be notified through when the outstanding request complets.

Here would be the psuedo code for what I am suggesting:

var wg sync.WaitGroup
setWg, stored := r.internalActiveRequestCache.LoadOrStore(request.ServiceAccountToken, wg)
if !stored {
	// Another request is outstanding, wait on this request
	setWg.Wait()
	val := internalCache.Get(request.ServiceAccountToken)
	// missing: adding checks in case the outstanding request failed and the val wasnt properly set
	return val
}
// Otherwise, we were the ones who successfully set the value, so we are responsible for making the 
// delegate call
defer func() {
	wg.Done()
	internalActiveRequestCache.Delete(request.ServiceAccountToken)
}()
iamCredentials, metadata, err := r.delegate.GetIamCredentials(ctx, request)
...

Something like this.

One downside of this particular approach is that there is no timeout on waiting for the outstanding request to finish. So if there is an issue there, then the thread would be waiting forever.

Copy link
Contributor Author

@xdu31 xdu31 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v1/credentials endpoint of eks-pod-identity-agent is served as AWS_CONTAINER_CREDENTIALS_FULL_URI of the application Pod, as mentioned above in #39 (comment), there is a default 1 sec timeout in container credential provider (search "a default value of `10001` (one second) is used" in the link)
Thus we wouldn't want the requests to perfectly waiting for the previous one, if the previous request is not finished in 1 sec, it is fine they go ahead and also try to reach to Auth service and cache

In the unit test, I tested with 32 concurrent requests, and use 16 in the commit (search for numRequests = 16)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of thread safe, I think you are suggesting to use sync.Map? The expiring.cache is using Mutex lock for handling concurrency, see all the operations it implements in https://github.com/aws/eks-pod-identity-agent/blob/main/internal/cache/expiring/cache.go

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like Modify could support Load-or-Store semantics

@xdu31
Copy link
Contributor Author

xdu31 commented Jan 8, 2025

For CI failure, let's explicitly add the write permission to the job so the forked PR can also have write access

jobs:
  test:
    ...
    permissions:
      contents: write
    steps:
      ...

e.g. https://github.com/kubernetes-sigs/aws-iam-authenticator/blob/master/.github/workflows/create-release.yml#L16

Fixed

@xdu31 xdu31 requested review from micahhausler and sanjams2 January 8, 2025 12:42
defaultThrottlingMsg = "account throttling"
defaultThrottledRequestCacheSize = 10
// default timeout for AWS credential provider is 1 sec:
// https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look at the Go SDK for consistency?

Copy link
Contributor Author

@xdu31 xdu31 Jan 9, 2025

@xdu31 xdu31 requested a review from micahhausler January 9, 2025 12:55
Copy link

@sanjams2 sanjams2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: so if the active request retries fail, then we will make a request, is that right?

If so, that makes me somewhat nervous. In general, the idea of 'constant work' is something that we should strive to employ. This means that code acts as consistently as it can in error sceniors or happy scenarios. Colm has a good blog about this here: https://aws.amazon.com/builders-library/reliability-and-constant-work/. In this case, if there is an issue in EKS Auth or anywhere in between and the request fails, we will start making more requests to EKS Auth. Imagine a scenario where EKS Auth is browning out and not responding on time - this code would actually potentially INCREASE the volume of requests to EKS Auth and compound the problem. Or what if we are getting throttled. That's something in general you want to avoid.

If the active request failed, should we not just fail ourselves? Or we can alternatively see about retrying but only having one retry at a time (as it stands now, all the waiting requests would fail)?

@xdu31
Copy link
Contributor Author

xdu31 commented Jan 10, 2025

QQ: so if the active request retries fail, then we will make a request, is that right?

If so, that makes me somewhat nervous. In general, the idea of 'constant work' is something that we should strive to employ. This means that code acts as consistently as it can in error sceniors or happy scenarios. Colm has a good blog about this here: https://aws.amazon.com/builders-library/reliability-and-constant-work/. In this case, if there is an issue in EKS Auth or anywhere in between and the request fails, we will start making more requests to EKS Auth. Imagine a scenario where EKS Auth is browning out and not responding on time - this code would actually potentially INCREASE the volume of requests to EKS Auth and compound the problem. Or what if we are getting throttled. That's something in general you want to avoid.

If the active request failed, should we not just fail ourselves? Or we can alternatively see about retrying but only having one retry at a time (as it stands now, all the waiting requests would fail)?

Good suggestion, I have made another commit to return from active request cache for other requests, in case of error.

if errActiveRequest, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {
// if there is an error from the active request, return the error
if errActiveRequest != nil {
log.Errorf("Failed the request with error from the same active request: %v\n", errActiveRequest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking nit: log lines shouldn't need a newline suffix

Comment on lines +204 to +210
defer func() {
if err == nil {
r.internalActiveRequestCache.Delete(request.ServiceAccountToken)
} else {
r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking nit: for a deferred func, you probably want to explicitly pass the error to the function

Suggested change
defer func() {
if err == nil {
r.internalActiveRequestCache.Delete(request.ServiceAccountToken)
} else {
r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)
}
}()
defer func(err error) {
if err == nil {
r.internalActiveRequestCache.Delete(request.ServiceAccountToken)
} else {
r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)
}
}(err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Micah, will follow up on these 2, I merged the change because this is actually covered by unit test, we are able to verify that error is stored in the memory cache.

@xdu31 xdu31 merged commit 984147e into aws:main Jan 10, 2025
1 check passed
taraspos pushed a commit to gravitational/eks-pod-identity-agent that referenced this pull request Mar 21, 2025
…nimize EKS Auth service calls (aws#39)

* Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls

* defaultActiveRequestRetries and defaultActiveRequestWaitTime

* Fix CI workflow

* Return from internalActiveRequestCache for other requests in case of error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants