Skip to content

Pre-clean orphaned ENIs and SGs before VPC stack delete#85

Open
jmelis wants to merge 1 commit into
mainfrom
jmelis/vpc-preclean-before-delete
Open

Pre-clean orphaned ENIs and SGs before VPC stack delete#85
jmelis wants to merge 1 commit into
mainfrom
jmelis/vpc-preclean-before-delete

Conversation

@jmelis

@jmelis jmelis commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add CleanVPCForDeletion() in internal/aws/ec2/cleanup.go that detaches/deletes orphaned ENIs and revokes/deletes non-default security groups in the VPC before CloudFormation stack deletion
  • Update DeleteVPC() to call the pre-cleanup step, so the stack delete succeeds on the first attempt instead of timing out

The e2e VPC CloudFormation stack delete consistently fails on the first attempt because ENIs and security groups left behind by hosted cluster teardown (LB controllers, CNI plugin) haven't been fully released yet. The test retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per CI run.

Test plan

  • Verify go build ./... and go vet ./... pass
  • Manual test: create a VPC, launch an HCP against it, tear down the HCP, then rosactl cluster-vpc delete -- should succeed on first attempt
  • CI: vpc-delete spec should pass on attempt 1 (no more 5-minute backoff sleeps in logs)
  • Follow-up PR in rosa-regional-frontend-api to remove the retry loop from e2e tests

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved VPC deletion reliability by automatically cleaning up leftover network interfaces and security groups before removing a cluster stack.
    • Reduced the chance of stack deletion failures caused by resources still attached to the VPC.
    • If cleanup can’t be completed, deletion still proceeds with warnings instead of stopping entirely.

The e2e VPC CloudFormation stack delete consistently fails on the first
attempt because ENIs and security groups left behind by hosted cluster
teardown (LB controllers, CNI) haven't been fully released yet. The test
retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per
CI run.

Add a pre-cleanup step to DeleteVPC that actively removes orphaned ENIs
and non-default security groups from the VPC before calling DeleteStack,
so the first attempt succeeds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 27, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmelis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2026
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Walkthrough

Adds internal/aws/ec2/cleanup.go with CleanVPCForDeletion, which removes orphaned ENIs and non-default security groups from a VPC before CloudFormation stack deletion. DeleteVPC in clustervpc/service.go is updated to call this cleanup using the VPC ID from stack outputs. Dependencies are updated to include aws-sdk-go-v2/service/ec2.

VPC Pre-Deletion Cleanup

Layer / File(s) Summary
EC2 VPC cleanup implementation
internal/aws/ec2/cleanup.go
CleanVPCForDeletion orchestrates two phases: deleteOrphanedENIs (force-detach and delete non-managed ENIs) and deleteNonDefaultSecurityGroups (revoke all rules then delete non-default groups), with polling via waitForENIAvailable.
DeleteVPC pre-cleanup integration
internal/services/clustervpc/service.go
DeleteVPC now reads stack outputs to extract VpcId, calls ec2.CleanVPCForDeletion (non-fatal on error), then proceeds with DeleteStack. Imports updated to add log and internal ec2 package.
Dependency updates
go.mod
Adds aws-sdk-go-v2/service/ec2 v1.308.0, bumps aws-sdk-go-v2 and smithy-go, and updates indirect AWS SDK internal modules.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Ai-Attribution ⚠️ Warning AI use is disclosed, but the commit only has Co-Authored-By: Claude Opus 4.6; no Assisted-by/Generated-by Red Hat trailer is present. Replace AI Co-Authored-By with an Assisted-by or Generated-by trailer (e.g. Generated-by: Claude Code) and avoid Co-Authored-By for AI tools.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: pre-cleaning ENIs and security groups before VPC stack deletion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto ✅ Passed Changed files and go.mod contain no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret comparisons.
Container-Privileges ✅ Passed The PR only changes go.mod, go.sum, and two Go files; no container/K8s manifest edits add privileged settings.
No-Sensitive-Data-In-Logs ✅ Passed Logs only mention VPC/ENI/SG IDs, stack names, and error status; no secrets, PII, or customer data are emitted.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets, embedded credentials, private keys, or suspicious secret-like literals found in the changed files.
No-Injection-Vectors ✅ Passed PASS: Touched files only use AWS SDK calls and logging; no SQL/shell/eval/pickle/yaml/os.system/dangerouslySetInnerHTML usage found.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jmelis/vpc-preclean-before-delete

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
internal/aws/ec2/cleanup.go (1)

79-80: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Honor ctx.Done() while waiting.

time.Sleep ignores cancellation, so a canceled CLI/Lambda invocation can still sit here for another sleep interval before the next AWS call notices the context. Use select with time.After and ctx.Done() instead. As per path instructions, **/*.go: context.Context for cancellation and timeouts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/aws/ec2/cleanup.go` around lines 79 - 80, The retry wait inside the
EC2 cleanup loop currently uses time.Sleep, which ignores cancellation; update
the waiting logic in the cleanup flow to honor ctx.Done() by using a select
between time.After and ctx.Done() before the next AWS call. Keep the change
localized to the loop in the cleanup function in internal/aws/ec2/cleanup.go and
preserve the existing retry behavior while allowing the operation to exit
promptly when the context is canceled.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 32-75: Surface partial cleanup failures from deleteOrphanedENIs so
CleanVPCForDeletion can fail when detach/delete operations do not succeed,
instead of swallowing the errors and returning nil after only logging warnings.
Track any detach or delete failure in deleteOrphanedENIs and return a wrapped
error at the end while preserving the existing per-ENI logging and
continue-on-error behavior; apply the same pattern in
deleteNonDefaultSecurityGroups so both cleanup helpers report blocking failures
back to DeleteVPC.
- Around line 33-37: The cleanup flow in describeVPCResources/deleteVPC only
fetches the first page of EC2 results, so leaked ENIs or security groups on
later pages can be missed. Update the DescribeNetworkInterfaces and
DescribeSecurityGroups paths to use the EC2 paginator APIs in the cleanup logic,
and ensure the full result set is collected before DeleteVPC proceeds.

---

Nitpick comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 79-80: The retry wait inside the EC2 cleanup loop currently uses
time.Sleep, which ignores cancellation; update the waiting logic in the cleanup
flow to honor ctx.Done() by using a select between time.After and ctx.Done()
before the next AWS call. Keep the change localized to the loop in the cleanup
function in internal/aws/ec2/cleanup.go and preserve the existing retry behavior
while allowing the operation to exit promptly when the context is canceled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift-online/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e141b003-5524-40d1-9616-0e8efa84c5ea

📥 Commits

Reviewing files that changed from the base of the PR and between 00c3e2a and 6a5e724.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • go.mod
  • internal/aws/ec2/cleanup.go
  • internal/services/clustervpc/service.go

Comment on lines +32 to +75
func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
Filters: []types.Filter{
{Name: aws.String("vpc-id"), Values: []string{vpcID}},
},
})
if err != nil {
return fmt.Errorf("describing ENIs: %w", err)
}

for _, eni := range out.NetworkInterfaces {
eniID := aws.ToString(eni.NetworkInterfaceId)

if eni.Attachment != nil && eni.Attachment.AttachmentId != nil {
// Only attempt detach for ENIs we can force-detach.
// ELB/Lambda-managed attachments will be cleaned up by their
// owning service; skip them to avoid API errors.
if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination {
log.Printf(" skipping ENI %s (managed attachment, will be cleaned by owner)", eniID)
continue
}

log.Printf(" detaching ENI %s", eniID)
_, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{
AttachmentId: eni.Attachment.AttachmentId,
Force: aws.Bool(true),
})
if err != nil {
log.Printf(" warning: failed to detach ENI %s: %v (will still try delete)", eniID, err)
} else {
waitForENIAvailable(ctx, client, eniID)
}
}

log.Printf(" deleting ENI %s", eniID)
_, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{
NetworkInterfaceId: aws.String(eniID),
})
if err != nil {
log.Printf(" warning: failed to delete ENI %s: %v", eniID, err)
}
}

return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Return partial cleanup failures to the caller.

These helpers only return an error when the initial Describe* call fails. If the one blocking ENI or security group fails to detach/delete, CleanVPCForDeletion still returns nil, so DeleteVPC proceeds as if pre-cleanup succeeded and the operator loses the real reason the first stack delete can still time out.

Possible fix
+import "errors"
+
 func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
+	var cleanupErr error
+
 	out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
 		Filters: []types.Filter{
 			{Name: aws.String("vpc-id"), Values: []string{vpcID}},
@@
 			_, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{
 				AttachmentId: eni.Attachment.AttachmentId,
 				Force:        aws.Bool(true),
 			})
 			if err != nil {
 				log.Printf("  warning: failed to detach ENI %s: %v (will still try delete)", eniID, err)
+				cleanupErr = errors.Join(cleanupErr, fmt.Errorf("detach ENI %s: %w", eniID, err))
 			} else {
 				waitForENIAvailable(ctx, client, eniID)
 			}
 		}
@@
 		_, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{
 			NetworkInterfaceId: aws.String(eniID),
 		})
 		if err != nil {
 			log.Printf("  warning: failed to delete ENI %s: %v", eniID, err)
+			cleanupErr = errors.Join(cleanupErr, fmt.Errorf("delete ENI %s: %w", eniID, err))
 		}
 	}
 
-	return nil
+	return cleanupErr
 }

Apply the same pattern in deleteNonDefaultSecurityGroups.

Also applies to: 93-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/aws/ec2/cleanup.go` around lines 32 - 75, Surface partial cleanup
failures from deleteOrphanedENIs so CleanVPCForDeletion can fail when
detach/delete operations do not succeed, instead of swallowing the errors and
returning nil after only logging warnings. Track any detach or delete failure in
deleteOrphanedENIs and return a wrapped error at the end while preserving the
existing per-ENI logging and continue-on-error behavior; apply the same pattern
in deleteNonDefaultSecurityGroups so both cleanup helpers report blocking
failures back to DeleteVPC.

Comment on lines +33 to +37
out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
Filters: []types.Filter{
{Name: aws.String("vpc-id"), Values: []string{vpcID}},
},
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the target file and inspect the relevant sections.
ast-grep outline internal/aws/ec2/cleanup.go --view expanded

printf '\n--- cleanup.go (selected lines) ---\n'
sed -n '1,220p' internal/aws/ec2/cleanup.go

printf '\n--- search for paginator usage in internal/aws/ec2 ---\n'
rg -n "New.*Paginator|Paginator\(" internal/aws/ec2 internal -g '*.go'

printf '\n--- inspect callers / surrounding cleanup flow ---\n'
rg -n "CleanVPCForDeletion|deleteOrphanedENIs|deleteNonDefaultSecurityGroups" internal -g '*.go'

Repository: openshift-online/rosa-hyperfleet-cli

Length of output: 6172


🌐 Web query:

AWS EC2 DescribeNetworkInterfaces DescribeSecurityGroups pagination NextToken paginator Go SDK v2

💡 Result:

In the AWS SDK for Go v2, you handle pagination for operations like DescribeNetworkInterfaces and DescribeSecurityGroups using the service-specific paginator helpers rather than manually managing the NextToken [1][2]. The SDK provides dedicated constructors for these paginators, which simplify the process by abstracting the request/response cycle [1][2]. To implement pagination, follow this pattern: 1. Use the appropriate constructor: ec2.NewDescribeNetworkInterfacesPaginator or ec2.NewDescribeSecurityGroupsPaginator [3][4]. 2. Create the paginator by passing your client and the input parameters [1][2]. You can optionally provide a functional argument to configure settings like the page limit [3][1]. 3. Iterate using the paginator's HasMorePages and NextPage methods [3][4][1]. Example for DescribeNetworkInterfaces: paginator:= ec2.NewDescribeNetworkInterfacesPaginator(ec2Client, &ec2.DescribeNetworkInterfacesInput{ MaxResults: aws.Int32(100), }) for paginator.HasMorePages { output, err:= paginator.NextPage(ctx) if err!= nil { log.Fatalf("failed to get page: %v", err) } for _, eni:= range output.NetworkInterfaces { fmt.Println(*eni.NetworkInterfaceId) } } The paginator automatically handles the NextToken internally, storing it and applying it to subsequent calls until no more pages remain [5][1]. This is the recommended approach over manual token handling, as it reduces error-prone boilerplate and ensures standard request lifecycle management [6][1][2].

Citations:


Use EC2 paginators for both listings. DescribeNetworkInterfaces and DescribeSecurityGroups are paginated, so a VPC with enough leaked ENIs or security groups can leave later pages untouched and still block DeleteVPC.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/aws/ec2/cleanup.go` around lines 33 - 37, The cleanup flow in
describeVPCResources/deleteVPC only fetches the first page of EC2 results, so
leaked ENIs or security groups on later pages can be missed. Update the
DescribeNetworkInterfaces and DescribeSecurityGroups paths to use the EC2
paginator APIs in the cleanup logic, and ensure the full result set is collected
before DeleteVPC proceeds.

func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
Filters: []types.Filter{
{Name: aws.String("vpc-id"), Values: []string{vpcID}},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add additional filter for status as well to ensure that the ENI being filtered is "available" only and not "in-use". That might simplify some of the additional checks being added.

Reference - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeNetworkInterfaces.html#API_DescribeNetworkInterfaces_RequestParameters

status - The status of the network interface. If the network interface is not attached to an instance, the status is available; if a network interface is attached to an instance the status is in-use.

Comment on lines +46 to +52
// Only attempt detach for ENIs we can force-detach.
// ELB/Lambda-managed attachments will be cleaned up by their
// owning service; skip them to avoid API errors.
if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination {
log.Printf(" skipping ENI %s (managed attachment, will be cleaned by owner)", eniID)
continue
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this part better. So for ELB/Lamba ENI we will as such not have DeleteOnTermination since it's a EC2 property so we will not be able to filter out ELB and other ENI through this check right?

I came across requester-managed: true (link) which would cover the list of ENI/resources that will be automatically handled.

So in the previous comment , we could have the following filters instead?

  • "Name=status,Values=available"
  • "Name=requester-managed,Values=false"

This would filter out the ENI that can be cleaned up and are not eligible for automatic deletion. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants