Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls #39

xdu31 · 2025-01-07T14:09:47Z

Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls

Issue #, if available: #34

Description of changes:

Dedupe concurrent active requests on the same service token, wait for the first request to finish cache, rather than making calls to EKS Auth service & cache to memory at the same time

This is achieved by an LRU cache of internalActiveRequestCache, when a serviceAccountToken is present in internalActiveRequestCache, it means there are multiple concurrent active requests on the same serviceAccountToken, in other words, other requests can wait a bit on the first one

Pause calls to EKS Auth service for 1 sec when account is being throttled to avoid flooding requests to EKS Auth service to avoid further stress to the system

This is achieved by a cache of internalThrottledRequestCache, when account throttling happens, the default key is added into internalThrottledRequestCache with a 1 sec expiration

Unit tests are added to test the changes

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…nimize EKS Auth service calls

micahhausler · 2025-01-07T16:54:52Z

internal/credsretriever/refreshing_cache.go

+	defaultThrottlingKey             = "default-throttling-key"
+	defaultThrottlingMsg             = "account throttling"
+	defaultThrottledRequestCacheSize = 10
+	defaultActiveRequestRetries      = 1


Why only 1 retry per request?

It is a combination of wait time & number of retries. defaultActiveRequestRetries * 1 sec = 1 sec, which is the default timeout for AWS credential provider: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/ ((search "a default value of 10001 (one second))

Actually it would be better if it is a bit shy of 1 sec

I took a look at the metrics of EKS Auth service, most of the requests are under 200ms, and with 1s default credential provider timeout, it would make more sense to retry 4 times with a wait of 200ms each time. I made a commit for this

Can we add jitter and backoff, not constantly repetition on retries?

micahhausler · 2025-01-07T16:58:22Z

internal/credsretriever/refreshing_cache.go

+		// Check if the request is in the cache, if it is, return it
+		if val, ok := r.internalCache.Get(request.ServiceAccountToken); ok {
+			if _, withinTtl := r.credentialsInEntryWithinValidTtl(val); withinTtl {
+				log.WithField("cache-hit", 1).Tracef("Using cached credentials")


Do we want to log the caller too? Do we have a pod IP we can log for differentiation?

This has been handled in

eks-pod-identity-agent/internal/middleware/logger/logger.go

Line 42 in b62eea2

"client-addr": r.RemoteAddr,

micahhausler · 2025-01-07T17:49:47Z

internal/credsretriever/refreshing_cache.go

 	if err != nil {
+		_, statusCode := errors.HandleCredentialFetchingError(ctx, err)
+		if statusCode == http.StatusTooManyRequests {
+			r.internalThrottledRequestCache.SetWithExpire(defaultThrottlingKey, true, 1*time.Second)


why hardcode to 1 second?

This is so account quota can be reset in the next second per the throttling rule

micahhausler · 2025-01-07T17:51:22Z

internal/credsretriever/refreshing_cache.go

-		r.internalCache.Delete(request.ServiceAccountToken)
+		if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {
+			// Wait 1s for active request to finish caching into internalCache
+			time.Sleep(1 * time.Second)


Why hardcode to 1 second? Don't we want any backoff/jitter we ever retry more than once?

See #39 (comment) and #39 (comment)

haoranleo · 2025-01-07T18:00:41Z

For CI failure, let's explicitly add the write permission to the job so the forked PR can also have write access

jobs:
  test:
    ...
    permissions:
      contents: write
    steps:
      ...

e.g. https://github.com/kubernetes-sigs/aws-iam-authenticator/blob/master/.github/workflows/create-release.yml#L16

sanjams2 · 2025-01-08T03:07:50Z

internal/credsretriever/refreshing_cache.go


-		log.Info("Identified that entry in cache contains credentials with small ttl or invalid ttl, will be deleted")
-		r.internalCache.Delete(request.ServiceAccountToken)
+		if _, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {


In general this isnt using concurrent map operations so there is still a a chance duplicate downstream calls are made depending on the order of operations. I feel we can do better to fully eliminate that possiblity which is of course the point of this change.

We need the equivalent of load-or-store semantics (not sure if that's offered by the expiring cache or not). If this thread successfully stores the value then it proceeds with making the downstream call. Otherwise (i.e. this thread loaded the existing value) then it should wait. To wait, it would make sense to me to have the value be something like a wait group or some other synchronization primitive that the waiting thread can be notified through when the outstanding request complets.

Here would be the psuedo code for what I am suggesting:

var wg sync.WaitGroup setWg, stored := r.internalActiveRequestCache.LoadOrStore(request.ServiceAccountToken, wg) if !stored { // Another request is outstanding, wait on this request setWg.Wait() val := internalCache.Get(request.ServiceAccountToken) // missing: adding checks in case the outstanding request failed and the val wasnt properly set return val } // Otherwise, we were the ones who successfully set the value, so we are responsible for making the // delegate call defer func() { wg.Done() internalActiveRequestCache.Delete(request.ServiceAccountToken) }() iamCredentials, metadata, err := r.delegate.GetIamCredentials(ctx, request) ...

Something like this.

One downside of this particular approach is that there is no timeout on waiting for the outstanding request to finish. So if there is an issue there, then the thread would be waiting forever.

The v1/credentials endpoint of eks-pod-identity-agent is served as AWS_CONTAINER_CREDENTIALS_FULL_URI of the application Pod, as mentioned above in #39 (comment), there is a default 1 sec timeout in container credential provider (search "a default value of `10001` (one second) is used" in the link)
Thus we wouldn't want the requests to perfectly waiting for the previous one, if the previous request is not finished in 1 sec, it is fine they go ahead and also try to reach to Auth service and cache

In the unit test, I tested with 32 concurrent requests, and use 16 in the commit (search for numRequests = 16)

In terms of thread safe, I think you are suggesting to use sync.Map? The expiring.cache is using Mutex lock for handling concurrency, see all the operations it implements in https://github.com/aws/eks-pod-identity-agent/blob/main/internal/cache/expiring/cache.go

it looks like Modify could support Load-or-Store semantics

xdu31 · 2025-01-08T12:42:22Z

For CI failure, let's explicitly add the write permission to the job so the forked PR can also have write access
jobs:
  test:
    ...
    permissions:
      contents: write
    steps:
      ...
e.g. https://github.com/kubernetes-sigs/aws-iam-authenticator/blob/master/.github/workflows/create-release.yml#L16

Fixed

micahhausler · 2025-01-08T18:35:13Z

internal/credsretriever/refreshing_cache.go

+	defaultThrottlingMsg             = "account throttling"
+	defaultThrottledRequestCacheSize = 10
+	// default timeout for AWS credential provider is 1 sec:
+	// https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-credential-providers/


Can we look at the Go SDK for consistency?

In an application Pod, it could have Java runtime, Go runtime, or Python, etc.

Do you have any suggestion on the values based on the Go & Java SDK http default client configuration for credential endpoint provider:

For Go SDK v2, there is not an explicit read timeout in endpointcreds:

https://github.com/aws/aws-sdk-go-v2/blob/ebb7c024c114fe872f65f8a1d58ba49f0cc1a376/aws/transport/http/client.go#L172-L175

https://github.com/aws/aws-sdk-go-v2/blob/ebb7c024c114fe872f65f8a1d58ba49f0cc1a376/aws/transport/http/client.go#L180-L181

https://github.com/aws/aws-sdk-go-v2/blob/ebb7c024c114fe872f65f8a1d58ba49f0cc1a376/aws/transport/http/client.go#L269

https://github.com/aws/aws-sdk-go-v2/blob/ebb7c024c114fe872f65f8a1d58ba49f0cc1a376/aws/transport/http/client.go#L84

https://github.com/aws/aws-sdk-go-v2/blob/ebb7c024c114fe872f65f8a1d58ba49f0cc1a376/aws/transport/http/client.go#L70

For Java, that seems to 1 sec:
https://github.com/aws/aws-sdk-java/blob/9197e60ae6388e4b00da4cc0a444f161ad49aa6a/aws-java-sdk-core/src/main/java/com/amazonaws/internal/EC2MetadataClient.java#L95

https://github.com/aws/aws-sdk-java/blob/9197e60ae6388e4b00da4cc0a444f161ad49aa6a/aws-java-sdk-core/src/main/java/com/amazonaws/auth/ContainerCredentialsFetcher.java#L41

sanjams2

QQ: so if the active request retries fail, then we will make a request, is that right?

If so, that makes me somewhat nervous. In general, the idea of 'constant work' is something that we should strive to employ. This means that code acts as consistently as it can in error sceniors or happy scenarios. Colm has a good blog about this here: https://aws.amazon.com/builders-library/reliability-and-constant-work/. In this case, if there is an issue in EKS Auth or anywhere in between and the request fails, we will start making more requests to EKS Auth. Imagine a scenario where EKS Auth is browning out and not responding on time - this code would actually potentially INCREASE the volume of requests to EKS Auth and compound the problem. Or what if we are getting throttled. That's something in general you want to avoid.

If the active request failed, should we not just fail ourselves? Or we can alternatively see about retrying but only having one retry at a time (as it stands now, all the waiting requests would fail)?

…error

xdu31 · 2025-01-10T00:40:09Z

QQ: so if the active request retries fail, then we will make a request, is that right?

If so, that makes me somewhat nervous. In general, the idea of 'constant work' is something that we should strive to employ. This means that code acts as consistently as it can in error sceniors or happy scenarios. Colm has a good blog about this here: https://aws.amazon.com/builders-library/reliability-and-constant-work/. In this case, if there is an issue in EKS Auth or anywhere in between and the request fails, we will start making more requests to EKS Auth. Imagine a scenario where EKS Auth is browning out and not responding on time - this code would actually potentially INCREASE the volume of requests to EKS Auth and compound the problem. Or what if we are getting throttled. That's something in general you want to avoid.

If the active request failed, should we not just fail ourselves? Or we can alternatively see about retrying but only having one retry at a time (as it stands now, all the waiting requests would fail)?

Good suggestion, I have made another commit to return from active request cache for other requests, in case of error.

micahhausler · 2025-01-10T16:29:16Z

internal/credsretriever/refreshing_cache.go

+		if errActiveRequest, ok := r.internalActiveRequestCache.Get(request.ServiceAccountToken); ok {
+			// if there is an error from the active request, return the error
+			if errActiveRequest != nil {
+				log.Errorf("Failed the request with error from the same active request: %v\n", errActiveRequest)


non blocking nit: log lines shouldn't need a newline suffix

micahhausler · 2025-01-10T16:31:42Z

internal/credsretriever/refreshing_cache.go

+	defer func() {
+		if err == nil {
+			r.internalActiveRequestCache.Delete(request.ServiceAccountToken)
+		} else {
+			r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)
+		}
+	}()


non blocking nit: for a deferred func, you probably want to explicitly pass the error to the function

Suggested change

defer func() {

if err == nil {

r.internalActiveRequestCache.Delete(request.ServiceAccountToken)

} else {

r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)

}

}()

defer func(err error) {

if err == nil {

r.internalActiveRequestCache.Delete(request.ServiceAccountToken)

} else {

r.internalActiveRequestCache.ReplaceWithExpire(request.ServiceAccountToken, err, defaultActiveRequestInterval)

}

}(err)

Hi Micah, will follow up on these 2, I merged the change because this is actually covered by unit test, we are able to verify that error is stored in the memory cache.

…nimize EKS Auth service calls (aws#39) * Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls * defaultActiveRequestRetries and defaultActiveRequestWaitTime * Fix CI workflow * Return from internalActiveRequestCache for other requests in case of error

Dedupe concurrent active requests & pause on account throttling to mi…

a4a4933

…nimize EKS Auth service calls

xdu31 requested a review from a team as a code owner January 7, 2025 14:09

micahhausler reviewed Jan 7, 2025

View reviewed changes

sanjams2 reviewed Jan 8, 2025

View reviewed changes

xdu31 added 2 commits January 8, 2025 04:37

defaultActiveRequestRetries and defaultActiveRequestWaitTime

e68c85b

Fix CI workflow

fb2626e

xdu31 requested review from micahhausler and sanjams2 January 8, 2025 12:42

micahhausler reviewed Jan 8, 2025

View reviewed changes

xdu31 requested a review from micahhausler January 9, 2025 12:55

micahhausler approved these changes Jan 9, 2025

View reviewed changes

sanjams2 reviewed Jan 9, 2025

View reviewed changes

Return from internalActiveRequestCache for other requests in case of …

e293fb4

…error

xdu31 requested review from micahhausler and sanjams2 January 10, 2025 00:29

micahhausler approved these changes Jan 10, 2025

View reviewed changes

sanjams2 approved these changes Jan 10, 2025

View reviewed changes

xdu31 merged commit 984147e into aws:main Jan 10, 2025
1 check passed

xdu31 mentioned this pull request Apr 14, 2025

Tune wait time on active request cache and remove caching on error #73

Merged

Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls #39

Dedupe concurrent active requests & pause on account throttling to minimize EKS Auth service calls #39

Uh oh!

Conversation

xdu31 commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haoranleo commented Jan 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Jan 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjams2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Jan 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xdu31 commented Jan 7, 2025 •

edited

Loading

xdu31 Jan 8, 2025 •

edited

Loading

xdu31 Jan 8, 2025 •

edited

Loading

xdu31 Jan 8, 2025 •

edited

Loading

xdu31 Jan 9, 2025 •

edited

Loading

sanjams2 left a comment •

edited

Loading