Skip to content

Pod Identity Association also subject to cache race condition? #264

@joshfrench

Description

@joshfrench

What happened:
Similar to #174 but specific to pod identity associations, we're observing the expected AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE env var is absent when a service account and pod are created within a short window. Typically we'll experience something like this:

  • Programmatically create a pod identity association, service account annotated with IAM role, and pod in short succession
  • The pod comes up, but AWS operations error with An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: No OpenIDConnect provider found in your account for https://oidc.eks...
  • Examining the pod env, note that AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE is missing but AWS_WEB_IDENTITY_TOKEN_FILE is set
  • Restart pod
  • Note that AWS_WEB_IDENTITY_TOKEN_FILE is now replaced with AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE and pod operates as expected.

What you expected to happen:
If all the prerequisites are satisfied, pods should get the correct pod identity association mutation regardless of timing.

How to reproduce it (as minimally and precisely as possible):

  1. Create an IAM role with the correct pod identity association trust policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:TagSession",
                "sts:AssumeRole"
            ],
        }
    ]
}
  1. Create an EKS cluster, enabling the EKS Pod Identity Agent add-on.
  2. Run aws eks update-kubeconfig --name my-cluster
  3. Run:
$ aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace default \
  --service-account test-sa \
  --role-arn arn:aws:iam::111111111111:role/test-role && \
sleep 0.75s && \
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: default
  name: test-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/test-role
---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: default
spec:
  serviceAccountName: test-sa
  containers:
    - name: test
      image: amazon/aws-cli:latest
      imagePullPolicy: IfNotPresent
      command:
        - aws
        - sts
        - get-caller-identity
EOF

When waiting ~750ms or less between creating the association and submitting the SA, I consistently get the incorrect AWS_WEB_IDENTITY_TOKEN_FILE. Above ~1s seems to be reliably sufficient to get the correct mutation.

Anything else we need to know?:
I'm wondering if something like #236 and/or #252 should be applied to the FileConfig, to allow the cache some time to catch up or to provide a fallback in case of cache miss. The scenario in which a serviceaccount and a pod are created in a short timeframe is common with CI/CD and infrastructure-as-code.

Environment:

  • AWS Region: us-east-2
  • EKS Platform version: eks.6
  • Kubernetes version: 1.32
  • Webhook Version: ¯\_(ツ)_/¯ whatever EKS is running under the hood

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions