Pre-clean orphaned ENIs and SGs before VPC stack delete#85
Conversation
The e2e VPC CloudFormation stack delete consistently fails on the first attempt because ENIs and security groups left behind by hosted cluster teardown (LB controllers, CNI) haven't been fully released yet. The test retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per CI run. Add a pre-cleanup step to DeleteVPC that actively removes orphaned ENIs and non-default security groups from the VPC before calling DeleteStack, so the first attempt succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jmelis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds VPC Pre-Deletion Cleanup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (9 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/aws/ec2/cleanup.go (1)
79-80: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winHonor
ctx.Done()while waiting.
time.Sleepignores cancellation, so a canceled CLI/Lambda invocation can still sit here for another sleep interval before the next AWS call notices the context. Useselectwithtime.Afterandctx.Done()instead. As per path instructions,**/*.go:context.Context for cancellation and timeouts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/aws/ec2/cleanup.go` around lines 79 - 80, The retry wait inside the EC2 cleanup loop currently uses time.Sleep, which ignores cancellation; update the waiting logic in the cleanup flow to honor ctx.Done() by using a select between time.After and ctx.Done() before the next AWS call. Keep the change localized to the loop in the cleanup function in internal/aws/ec2/cleanup.go and preserve the existing retry behavior while allowing the operation to exit promptly when the context is canceled.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 32-75: Surface partial cleanup failures from deleteOrphanedENIs so
CleanVPCForDeletion can fail when detach/delete operations do not succeed,
instead of swallowing the errors and returning nil after only logging warnings.
Track any detach or delete failure in deleteOrphanedENIs and return a wrapped
error at the end while preserving the existing per-ENI logging and
continue-on-error behavior; apply the same pattern in
deleteNonDefaultSecurityGroups so both cleanup helpers report blocking failures
back to DeleteVPC.
- Around line 33-37: The cleanup flow in describeVPCResources/deleteVPC only
fetches the first page of EC2 results, so leaked ENIs or security groups on
later pages can be missed. Update the DescribeNetworkInterfaces and
DescribeSecurityGroups paths to use the EC2 paginator APIs in the cleanup logic,
and ensure the full result set is collected before DeleteVPC proceeds.
---
Nitpick comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 79-80: The retry wait inside the EC2 cleanup loop currently uses
time.Sleep, which ignores cancellation; update the waiting logic in the cleanup
flow to honor ctx.Done() by using a select between time.After and ctx.Done()
before the next AWS call. Keep the change localized to the loop in the cleanup
function in internal/aws/ec2/cleanup.go and preserve the existing retry behavior
while allowing the operation to exit promptly when the context is canceled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift-online/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e141b003-5524-40d1-9616-0e8efa84c5ea
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (3)
go.modinternal/aws/ec2/cleanup.gointernal/services/clustervpc/service.go
| func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error { | ||
| out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ | ||
| Filters: []types.Filter{ | ||
| {Name: aws.String("vpc-id"), Values: []string{vpcID}}, | ||
| }, | ||
| }) | ||
| if err != nil { | ||
| return fmt.Errorf("describing ENIs: %w", err) | ||
| } | ||
|
|
||
| for _, eni := range out.NetworkInterfaces { | ||
| eniID := aws.ToString(eni.NetworkInterfaceId) | ||
|
|
||
| if eni.Attachment != nil && eni.Attachment.AttachmentId != nil { | ||
| // Only attempt detach for ENIs we can force-detach. | ||
| // ELB/Lambda-managed attachments will be cleaned up by their | ||
| // owning service; skip them to avoid API errors. | ||
| if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination { | ||
| log.Printf(" skipping ENI %s (managed attachment, will be cleaned by owner)", eniID) | ||
| continue | ||
| } | ||
|
|
||
| log.Printf(" detaching ENI %s", eniID) | ||
| _, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{ | ||
| AttachmentId: eni.Attachment.AttachmentId, | ||
| Force: aws.Bool(true), | ||
| }) | ||
| if err != nil { | ||
| log.Printf(" warning: failed to detach ENI %s: %v (will still try delete)", eniID, err) | ||
| } else { | ||
| waitForENIAvailable(ctx, client, eniID) | ||
| } | ||
| } | ||
|
|
||
| log.Printf(" deleting ENI %s", eniID) | ||
| _, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{ | ||
| NetworkInterfaceId: aws.String(eniID), | ||
| }) | ||
| if err != nil { | ||
| log.Printf(" warning: failed to delete ENI %s: %v", eniID, err) | ||
| } | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Return partial cleanup failures to the caller.
These helpers only return an error when the initial Describe* call fails. If the one blocking ENI or security group fails to detach/delete, CleanVPCForDeletion still returns nil, so DeleteVPC proceeds as if pre-cleanup succeeded and the operator loses the real reason the first stack delete can still time out.
Possible fix
+import "errors"
+
func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
+ var cleanupErr error
+
out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
Filters: []types.Filter{
{Name: aws.String("vpc-id"), Values: []string{vpcID}},
@@
_, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{
AttachmentId: eni.Attachment.AttachmentId,
Force: aws.Bool(true),
})
if err != nil {
log.Printf(" warning: failed to detach ENI %s: %v (will still try delete)", eniID, err)
+ cleanupErr = errors.Join(cleanupErr, fmt.Errorf("detach ENI %s: %w", eniID, err))
} else {
waitForENIAvailable(ctx, client, eniID)
}
}
@@
_, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{
NetworkInterfaceId: aws.String(eniID),
})
if err != nil {
log.Printf(" warning: failed to delete ENI %s: %v", eniID, err)
+ cleanupErr = errors.Join(cleanupErr, fmt.Errorf("delete ENI %s: %w", eniID, err))
}
}
- return nil
+ return cleanupErr
}Apply the same pattern in deleteNonDefaultSecurityGroups.
Also applies to: 93-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/aws/ec2/cleanup.go` around lines 32 - 75, Surface partial cleanup
failures from deleteOrphanedENIs so CleanVPCForDeletion can fail when
detach/delete operations do not succeed, instead of swallowing the errors and
returning nil after only logging warnings. Track any detach or delete failure in
deleteOrphanedENIs and return a wrapped error at the end while preserving the
existing per-ENI logging and continue-on-error behavior; apply the same pattern
in deleteNonDefaultSecurityGroups so both cleanup helpers report blocking
failures back to DeleteVPC.
| out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ | ||
| Filters: []types.Filter{ | ||
| {Name: aws.String("vpc-id"), Values: []string{vpcID}}, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the target file and inspect the relevant sections.
ast-grep outline internal/aws/ec2/cleanup.go --view expanded
printf '\n--- cleanup.go (selected lines) ---\n'
sed -n '1,220p' internal/aws/ec2/cleanup.go
printf '\n--- search for paginator usage in internal/aws/ec2 ---\n'
rg -n "New.*Paginator|Paginator\(" internal/aws/ec2 internal -g '*.go'
printf '\n--- inspect callers / surrounding cleanup flow ---\n'
rg -n "CleanVPCForDeletion|deleteOrphanedENIs|deleteNonDefaultSecurityGroups" internal -g '*.go'Repository: openshift-online/rosa-hyperfleet-cli
Length of output: 6172
🌐 Web query:
AWS EC2 DescribeNetworkInterfaces DescribeSecurityGroups pagination NextToken paginator Go SDK v2
💡 Result:
In the AWS SDK for Go v2, you handle pagination for operations like DescribeNetworkInterfaces and DescribeSecurityGroups using the service-specific paginator helpers rather than manually managing the NextToken [1][2]. The SDK provides dedicated constructors for these paginators, which simplify the process by abstracting the request/response cycle [1][2]. To implement pagination, follow this pattern: 1. Use the appropriate constructor: ec2.NewDescribeNetworkInterfacesPaginator or ec2.NewDescribeSecurityGroupsPaginator [3][4]. 2. Create the paginator by passing your client and the input parameters [1][2]. You can optionally provide a functional argument to configure settings like the page limit [3][1]. 3. Iterate using the paginator's HasMorePages and NextPage methods [3][4][1]. Example for DescribeNetworkInterfaces: paginator:= ec2.NewDescribeNetworkInterfacesPaginator(ec2Client, &ec2.DescribeNetworkInterfacesInput{ MaxResults: aws.Int32(100), }) for paginator.HasMorePages { output, err:= paginator.NextPage(ctx) if err!= nil { log.Fatalf("failed to get page: %v", err) } for _, eni:= range output.NetworkInterfaces { fmt.Println(*eni.NetworkInterfaceId) } } The paginator automatically handles the NextToken internally, storing it and applying it to subsequent calls until no more pages remain [5][1]. This is the recommended approach over manual token handling, as it reduces error-prone boilerplate and ensures standard request lifecycle management [6][1][2].
Citations:
- 1: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/using.html
- 2: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/migrate-gosdk.html
- 3: https://stackoverflow.com/questions/78258095/parsing-comparing-values-from-paginator-in-aws-golang-sdk
- 4: https://github.com/cilium/cilium/blob/8fade44c/pkg/aws/ec2/ec2.go
- 5: https://awscli.amazonaws.com/v2/documentation/api/2.13.38/reference/ec2/describe-security-groups.html
- 6: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeNetworkInterfaces.html
Use EC2 paginators for both listings. DescribeNetworkInterfaces and DescribeSecurityGroups are paginated, so a VPC with enough leaked ENIs or security groups can leave later pages untouched and still block DeleteVPC.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/aws/ec2/cleanup.go` around lines 33 - 37, The cleanup flow in
describeVPCResources/deleteVPC only fetches the first page of EC2 results, so
leaked ENIs or security groups on later pages can be missed. Update the
DescribeNetworkInterfaces and DescribeSecurityGroups paths to use the EC2
paginator APIs in the cleanup logic, and ensure the full result set is collected
before DeleteVPC proceeds.
| func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error { | ||
| out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ | ||
| Filters: []types.Filter{ | ||
| {Name: aws.String("vpc-id"), Values: []string{vpcID}}, |
There was a problem hiding this comment.
Should we add additional filter for status as well to ensure that the ENI being filtered is "available" only and not "in-use". That might simplify some of the additional checks being added.
status - The status of the network interface. If the network interface is not attached to an instance, the status is available; if a network interface is attached to an instance the status is in-use.
| // Only attempt detach for ENIs we can force-detach. | ||
| // ELB/Lambda-managed attachments will be cleaned up by their | ||
| // owning service; skip them to avoid API errors. | ||
| if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination { | ||
| log.Printf(" skipping ENI %s (managed attachment, will be cleaned by owner)", eniID) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Trying to understand this part better. So for ELB/Lamba ENI we will as such not have DeleteOnTermination since it's a EC2 property so we will not be able to filter out ELB and other ENI through this check right?
I came across requester-managed: true (link) which would cover the list of ENI/resources that will be automatically handled.
So in the previous comment , we could have the following filters instead?
- "Name=status,Values=available"
- "Name=requester-managed,Values=false"
This would filter out the ENI that can be cleaned up and are not eligible for automatic deletion. Thoughts?
Summary
CleanVPCForDeletion()ininternal/aws/ec2/cleanup.gothat detaches/deletes orphaned ENIs and revokes/deletes non-default security groups in the VPC before CloudFormation stack deletionDeleteVPC()to call the pre-cleanup step, so the stack delete succeeds on the first attempt instead of timing outThe e2e VPC CloudFormation stack delete consistently fails on the first attempt because ENIs and security groups left behind by hosted cluster teardown (LB controllers, CNI plugin) haven't been fully released yet. The test retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per CI run.
Test plan
go build ./...andgo vet ./...passrosactl cluster-vpc delete-- should succeed on first attempt🤖 Generated with Claude Code
Summary by CodeRabbit