Skip to content

[Bug] security_group_name config fails in 0.12.1 — ec2:CreateTags error not caught during default SG creation #9466

@matt-harding

Description

@matt-harding

[Bug] security_group_name config fails in 0.12.1 — ec2:CreateTags error not caught during default SG creation

System Info

  • SkyPilot version: 0.12.1
  • Cloud: AWS
  • API server: Helm-deployed on EKS
  • Upgraded from: 0.10.3

Description

When using aws.security_group_name in ~/.sky/config.yaml to reuse a pre-existing security group, sky launch fails because SkyPilot's attempt to create the default security group (a deletion optimization) hits an ec2:CreateTags IAM error that isn't caught by the error handler.

This is a regression from 0.10.x. The documented minimal IAM policy is also affected — it grants ec2:CreateTags on instance/* only, not security-group/*.

Config

# ~/.sky/config.yaml
aws:
  security_group_name:
    - "crawl-*": my-existing-security-group

The security group my-existing-security-group exists in the target VPC in eu-west-2.

Task YAML

resources:
  cpus: 2
  memory: 4
  cloud: aws
  region: eu-west-2

Error

W config.py:775] Failed to create security group. Error:
  botocore.exceptions.ClientError: An error occurred (UnauthorizedOperation)
  when calling the CreateSecurityGroup operation: You are not authorized to
  perform this operation. User: arn:aws:iam::XXXX:user/system/skypilot-user
  is not authorized to perform: ec2:CreateTags on resource:
  arn:aws:ec2:eu-west-2:XXXX:security-group/*

sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones

Root Cause

In sky/provision/aws/config.py around line 114, when a custom security_group_name is configured:

  1. The custom SG (my-existing-security-group) is found successfully — the provisioning log confirms "GroupName": "smy-existing-security-group" in the provider config
  2. Because expected_sg_name != DEFAULT_SECURITY_GROUP_NAME, SkyPilot also tries to create/find a default SG (sky-sg-{user}-{hash}) as a deletion optimization
  3. The default SG doesn't exist, so _get_or_create_vpc_security_group tries to create it
  4. CreateSecurityGroup now includes TagSpecifications (added in the 0.12.x VPC refactor), so AWS evaluates ec2:CreateTags first and denies it

The try/except at line 126 was intended to handle this gracefully:

except exceptions.NoClusterLaunchedError as e:
    if 'not authorized to perform: ec2:CreateSecurityGroup' in str(e):
        pass  # intended to silently ignore
    else:
        raise e  # ← BUG: falls through here

The string check looks for ec2:CreateSecurityGroup but the actual error is ec2:CreateTags. The exception falls through to raise e.

Two Issues

1. Error handler string mismatch (code bug)

The catch block at config.py:127 checks for ec2:CreateSecurityGroup but AWS returns ec2:CreateTags because CreateSecurityGroup now includes TagSpecifications.

2. Minimal IAM policy is incompatible with 0.12.1 (docs bug)

The documented minimal IAM policy grants ec2:CreateTags only on instance/*:

{
    "Action": ["ec2:CreateTags", "ec2:DeleteTags", ...],
    "Resource": "arn:aws:ec2:*:<account-ID>:instance/*"
}

But CreateSecurityGroup with TagSpecifications requires ec2:CreateTags on security-group/*. This means any user following the minimal IAM docs will hit this error on first launch (not just users of security_group_name).

Suggested Fix

Code fix — widen the error match in sky/provision/aws/config.py, since this is explicitly a best-effort operation (the comment says "If the default security group is not created, we will need to block on instance termination"):

except exceptions.NoClusterLaunchedError as e:
    if 'UnauthorizedOperation' in str(e):
        logger.debug('User does not have permission to create '
                     f'the default security group. {e}')
    else:
        raise e

Docs fix — update the minimal IAM policy to include ec2:CreateTags on security-group/*, or separate TagSpecifications from the CreateSecurityGroup call.

Use Case

We're using SkyPilot to orchestrate many parallel short-lived EC2 instances (web crawlers) via Dagster. We configure security_group_name to reuse a single pre-existing SG rather than creating one per cluster. This worked in 0.10.3 and broke after upgrading to 0.12.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions