[Bug] security_group_name config fails in 0.12.1 — ec2:CreateTags error not caught during default SG creation
System Info
- SkyPilot version: 0.12.1
- Cloud: AWS
- API server: Helm-deployed on EKS
- Upgraded from: 0.10.3
Description
When using aws.security_group_name in ~/.sky/config.yaml to reuse a pre-existing security group, sky launch fails because SkyPilot's attempt to create the default security group (a deletion optimization) hits an ec2:CreateTags IAM error that isn't caught by the error handler.
This is a regression from 0.10.x. The documented minimal IAM policy is also affected — it grants ec2:CreateTags on instance/* only, not security-group/*.
Config
# ~/.sky/config.yaml
aws:
security_group_name:
- "crawl-*": my-existing-security-group
The security group my-existing-security-group exists in the target VPC in eu-west-2.
Task YAML
resources:
cpus: 2
memory: 4
cloud: aws
region: eu-west-2
Error
W config.py:775] Failed to create security group. Error:
botocore.exceptions.ClientError: An error occurred (UnauthorizedOperation)
when calling the CreateSecurityGroup operation: You are not authorized to
perform this operation. User: arn:aws:iam::XXXX:user/system/skypilot-user
is not authorized to perform: ec2:CreateTags on resource:
arn:aws:ec2:eu-west-2:XXXX:security-group/*
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones
Root Cause
In sky/provision/aws/config.py around line 114, when a custom security_group_name is configured:
- The custom SG (
my-existing-security-group) is found successfully — the provisioning log confirms "GroupName": "smy-existing-security-group" in the provider config
- Because
expected_sg_name != DEFAULT_SECURITY_GROUP_NAME, SkyPilot also tries to create/find a default SG (sky-sg-{user}-{hash}) as a deletion optimization
- The default SG doesn't exist, so
_get_or_create_vpc_security_group tries to create it
CreateSecurityGroup now includes TagSpecifications (added in the 0.12.x VPC refactor), so AWS evaluates ec2:CreateTags first and denies it
The try/except at line 126 was intended to handle this gracefully:
except exceptions.NoClusterLaunchedError as e:
if 'not authorized to perform: ec2:CreateSecurityGroup' in str(e):
pass # intended to silently ignore
else:
raise e # ← BUG: falls through here
The string check looks for ec2:CreateSecurityGroup but the actual error is ec2:CreateTags. The exception falls through to raise e.
Two Issues
1. Error handler string mismatch (code bug)
The catch block at config.py:127 checks for ec2:CreateSecurityGroup but AWS returns ec2:CreateTags because CreateSecurityGroup now includes TagSpecifications.
2. Minimal IAM policy is incompatible with 0.12.1 (docs bug)
The documented minimal IAM policy grants ec2:CreateTags only on instance/*:
{
"Action": ["ec2:CreateTags", "ec2:DeleteTags", ...],
"Resource": "arn:aws:ec2:*:<account-ID>:instance/*"
}
But CreateSecurityGroup with TagSpecifications requires ec2:CreateTags on security-group/*. This means any user following the minimal IAM docs will hit this error on first launch (not just users of security_group_name).
Suggested Fix
Code fix — widen the error match in sky/provision/aws/config.py, since this is explicitly a best-effort operation (the comment says "If the default security group is not created, we will need to block on instance termination"):
except exceptions.NoClusterLaunchedError as e:
if 'UnauthorizedOperation' in str(e):
logger.debug('User does not have permission to create '
f'the default security group. {e}')
else:
raise e
Docs fix — update the minimal IAM policy to include ec2:CreateTags on security-group/*, or separate TagSpecifications from the CreateSecurityGroup call.
Use Case
We're using SkyPilot to orchestrate many parallel short-lived EC2 instances (web crawlers) via Dagster. We configure security_group_name to reuse a single pre-existing SG rather than creating one per cluster. This worked in 0.10.3 and broke after upgrading to 0.12.1.
[Bug]
security_group_nameconfig fails in 0.12.1 —ec2:CreateTagserror not caught during default SG creationSystem Info
Description
When using
aws.security_group_namein~/.sky/config.yamlto reuse a pre-existing security group,sky launchfails because SkyPilot's attempt to create the default security group (a deletion optimization) hits anec2:CreateTagsIAM error that isn't caught by the error handler.This is a regression from 0.10.x. The documented minimal IAM policy is also affected — it grants
ec2:CreateTagsoninstance/*only, notsecurity-group/*.Config
The security group
my-existing-security-groupexists in the target VPC ineu-west-2.Task YAML
Error
Root Cause
In
sky/provision/aws/config.pyaround line 114, when a customsecurity_group_nameis configured:my-existing-security-group) is found successfully — the provisioning log confirms"GroupName": "smy-existing-security-group"in the provider configexpected_sg_name != DEFAULT_SECURITY_GROUP_NAME, SkyPilot also tries to create/find a default SG (sky-sg-{user}-{hash}) as a deletion optimization_get_or_create_vpc_security_grouptries to create itCreateSecurityGroupnow includesTagSpecifications(added in the 0.12.x VPC refactor), so AWS evaluatesec2:CreateTagsfirst and denies itThe
try/exceptat line 126 was intended to handle this gracefully:The string check looks for
ec2:CreateSecurityGroupbut the actual error isec2:CreateTags. The exception falls through toraise e.Two Issues
1. Error handler string mismatch (code bug)
The catch block at
config.py:127checks forec2:CreateSecurityGroupbut AWS returnsec2:CreateTagsbecauseCreateSecurityGroupnow includesTagSpecifications.2. Minimal IAM policy is incompatible with 0.12.1 (docs bug)
The documented minimal IAM policy grants
ec2:CreateTagsonly oninstance/*:{ "Action": ["ec2:CreateTags", "ec2:DeleteTags", ...], "Resource": "arn:aws:ec2:*:<account-ID>:instance/*" }But
CreateSecurityGroupwithTagSpecificationsrequiresec2:CreateTagsonsecurity-group/*. This means any user following the minimal IAM docs will hit this error on first launch (not just users ofsecurity_group_name).Suggested Fix
Code fix — widen the error match in
sky/provision/aws/config.py, since this is explicitly a best-effort operation (the comment says "If the default security group is not created, we will need to block on instance termination"):Docs fix — update the minimal IAM policy to include
ec2:CreateTagsonsecurity-group/*, or separateTagSpecificationsfrom theCreateSecurityGroupcall.Use Case
We're using SkyPilot to orchestrate many parallel short-lived EC2 instances (web crawlers) via Dagster. We configure
security_group_nameto reuse a single pre-existing SG rather than creating one per cluster. This worked in 0.10.3 and broke after upgrading to 0.12.1.