Skip to content

iam: application of role overflow policy can interrupt service availability #35611

@rix0rrr

Description

@rix0rrr

Describe the bug

Roles generate overflow policies when their policy document exceeds 10k bytes.

This means that some statements get split off to managed policies, that then get attached to the role.

Image

The ordering is not necessarily deterministic, which means that statements that get moved between policies may be not present for a number of seconds, after which it recovers:

Image

This means there is a potential for a service interruption if this deploys.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Library Version

No response

Expected Behavior

No service interruption.

Current Behavior

Potential for service interruption, depending on how the dice roll.

Reproduction Steps

Have a policy that leads to overflow.

Possible Solution

  • The best solution would be to add new policies before deleting old ones. This means we need to model every statement as its own policy, so that CloudFormation's resource replacement strategy can do its thing.
    • Additional benefits of this are that any type of policy update, for example due to Lambda versioning and narrow permissions, doesn't suffer from interruptions.
    • A downside is that in the limit, if the entire policy needs to be replaced, this effectively halves the amount of policy you can attach to a single role (because both parts of the effective policy need to be present on the role at the same time, and since the role can only have 40k policy in total, this means effective policy can be at most 20k -- and that doesn't even account for cutting losses).
  • An alternative solution would be to disable policy overflow altogether, to not get into the situation that causes problems like this.
  • We can track allocations of statements -> policies in some additional JSON files that users have to keep in their code repositories, and use that to keep a stable assignment that doesn't suffer from shifting.
  • A custom resource could do statement shifting cleverly without taking up too much policy space... but custom resources are annoying to build and maintain, customers don't like them, and this would be a high risk/high scrutiny custom resource.

Additional Information/Context

No response

AWS CDK Library version (aws-cdk-lib)

AWS CDK CLI version

Node.js Version

OS

Language

TypeScript

Language Version

No response

Other information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    @aws-cdk/aws-iamRelated to AWS Identity and Access Managementeffort/mediumMedium work item – several days of effortp1

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions