Pre-clean orphaned ENIs and SGs before VPC stack delete by jmelis · Pull Request #85 · openshift-online/rosa-hyperfleet-cli

jmelis · 2026-06-27T20:12:58Z

Summary

Add CleanVPCForDeletion() in internal/aws/ec2/cleanup.go that detaches/deletes orphaned ENIs and revokes/deletes non-default security groups in the VPC before CloudFormation stack deletion
Update DeleteVPC() to call the pre-cleanup step, so the stack delete succeeds on the first attempt instead of timing out

The e2e VPC CloudFormation stack delete consistently fails on the first attempt because ENIs and security groups left behind by hosted cluster teardown (LB controllers, CNI plugin) haven't been fully released yet. The test retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per CI run.

Test plan

Verify go build ./... and go vet ./... pass
Manual test: create a VPC, launch an HCP against it, tear down the HCP, then rosactl cluster-vpc delete -- should succeed on first attempt
CI: vpc-delete spec should pass on attempt 1 (no more 5-minute backoff sleeps in logs)
Follow-up PR in rosa-regional-frontend-api to remove the retry loop from e2e tests

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved VPC deletion reliability by automatically cleaning up leftover network interfaces and security groups before removing a cluster stack.
- Reduced the chance of stack deletion failures caused by resources still attached to the VPC.
- If cleanup can’t be completed, deletion still proceeds with warnings instead of stopping entirely.

The e2e VPC CloudFormation stack delete consistently fails on the first attempt because ENIs and security groups left behind by hosted cluster teardown (LB controllers, CNI) haven't been fully released yet. The test retries up to 3 times with a 5-minute backoff, wasting ~19.5 minutes per CI run. Add a pre-cleanup step to DeleteVPC that actively removes orphaned ENIs and non-default security groups from the VPC before calling DeleteStack, so the first attempt succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-27T20:13:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmelis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmelis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-27T20:13:10Z

Walkthrough

Adds internal/aws/ec2/cleanup.go with CleanVPCForDeletion, which removes orphaned ENIs and non-default security groups from a VPC before CloudFormation stack deletion. DeleteVPC in clustervpc/service.go is updated to call this cleanup using the VPC ID from stack outputs. Dependencies are updated to include aws-sdk-go-v2/service/ec2.

VPC Pre-Deletion Cleanup

Layer / File(s)	Summary
EC2 VPC cleanup implementation `internal/aws/ec2/cleanup.go`	`CleanVPCForDeletion` orchestrates two phases: `deleteOrphanedENIs` (force-detach and delete non-managed ENIs) and `deleteNonDefaultSecurityGroups` (revoke all rules then delete non-default groups), with polling via `waitForENIAvailable`.
DeleteVPC pre-cleanup integration `internal/services/clustervpc/service.go`	`DeleteVPC` now reads stack outputs to extract `VpcId`, calls `ec2.CleanVPCForDeletion` (non-fatal on error), then proceeds with `DeleteStack`. Imports updated to add `log` and internal `ec2` package.
Dependency updates `go.mod`	Adds `aws-sdk-go-v2/service/ec2 v1.308.0`, bumps `aws-sdk-go-v2` and `smithy-go`, and updates indirect AWS SDK internal modules.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Ai-Attribution	⚠️ Warning	AI use is disclosed, but the commit only has Co-Authored-By: Claude Opus 4.6; no Assisted-by/Generated-by Red Hat trailer is present.	Replace AI Co-Authored-By with an Assisted-by or Generated-by trailer (e.g. Generated-by: Claude Code) and avoid Co-Authored-By for AI tools.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: pre-cleaning ENIs and security groups before VPC stack deletion.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
No-Weak-Crypto	✅ Passed	Changed files and go.mod contain no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret comparisons.
Container-Privileges	✅ Passed	The PR only changes go.mod, go.sum, and two Go files; no container/K8s manifest edits add privileged settings.
No-Sensitive-Data-In-Logs	✅ Passed	Logs only mention VPC/ENI/SG IDs, stack names, and error status; no secrets, PII, or customer data are emitted.
No-Hardcoded-Secrets	✅ Passed	No hardcoded secrets, embedded credentials, private keys, or suspicious secret-like literals found in the changed files.
No-Injection-Vectors	✅ Passed	PASS: Touched files only use AWS SDK calls and logging; no SQL/shell/eval/pickle/yaml/os.system/dangerouslySetInnerHTML usage found.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jmelis/vpc-preclean-before-delete

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

internal/aws/ec2/cleanup.go (1)
79-80: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Honor ctx.Done() while waiting.

time.Sleep ignores cancellation, so a canceled CLI/Lambda invocation can still sit here for another sleep interval before the next AWS call notices the context. Use select with time.After and ctx.Done() instead. As per path instructions, **/*.go: context.Context for cancellation and timeouts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/aws/ec2/cleanup.go` around lines 79 - 80, The retry wait inside the
EC2 cleanup loop currently uses time.Sleep, which ignores cancellation; update
the waiting logic in the cleanup flow to honor ctx.Done() by using a select
between time.After and ctx.Done() before the next AWS call. Keep the change
localized to the loop in the cleanup function in internal/aws/ec2/cleanup.go and
preserve the existing retry behavior while allowing the operation to exit
promptly when the context is canceled.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 32-75: Surface partial cleanup failures from deleteOrphanedENIs so
CleanVPCForDeletion can fail when detach/delete operations do not succeed,
instead of swallowing the errors and returning nil after only logging warnings.
Track any detach or delete failure in deleteOrphanedENIs and return a wrapped
error at the end while preserving the existing per-ENI logging and
continue-on-error behavior; apply the same pattern in
deleteNonDefaultSecurityGroups so both cleanup helpers report blocking failures
back to DeleteVPC.
- Around line 33-37: The cleanup flow in describeVPCResources/deleteVPC only
fetches the first page of EC2 results, so leaked ENIs or security groups on
later pages can be missed. Update the DescribeNetworkInterfaces and
DescribeSecurityGroups paths to use the EC2 paginator APIs in the cleanup logic,
and ensure the full result set is collected before DeleteVPC proceeds.

---

Nitpick comments:
In `@internal/aws/ec2/cleanup.go`:
- Around line 79-80: The retry wait inside the EC2 cleanup loop currently uses
time.Sleep, which ignores cancellation; update the waiting logic in the cleanup
flow to honor ctx.Done() by using a select between time.After and ctx.Done()
before the next AWS call. Keep the change localized to the loop in the cleanup
function in internal/aws/ec2/cleanup.go and preserve the existing retry behavior
while allowing the operation to exit promptly when the context is canceled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift-online/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e141b003-5524-40d1-9616-0e8efa84c5ea

📥 Commits

Reviewing files that changed from the base of the PR and between 00c3e2a and 6a5e724.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (3)

go.mod
internal/aws/ec2/cleanup.go
internal/services/clustervpc/service.go

coderabbitai · 2026-06-27T20:19:01Z

+func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
+	out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
+		Filters: []types.Filter{
+			{Name: aws.String("vpc-id"), Values: []string{vpcID}},
+		},
+	})
+	if err != nil {
+		return fmt.Errorf("describing ENIs: %w", err)
+	}
+
+	for _, eni := range out.NetworkInterfaces {
+		eniID := aws.ToString(eni.NetworkInterfaceId)
+
+		if eni.Attachment != nil && eni.Attachment.AttachmentId != nil {
+			// Only attempt detach for ENIs we can force-detach.
+			// ELB/Lambda-managed attachments will be cleaned up by their
+			// owning service; skip them to avoid API errors.
+			if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination {
+				log.Printf("  skipping ENI %s (managed attachment, will be cleaned by owner)", eniID)
+				continue
+			}
+
+			log.Printf("  detaching ENI %s", eniID)
+			_, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{
+				AttachmentId: eni.Attachment.AttachmentId,
+				Force:        aws.Bool(true),
+			})
+			if err != nil {
+				log.Printf("  warning: failed to detach ENI %s: %v (will still try delete)", eniID, err)
+			} else {
+				waitForENIAvailable(ctx, client, eniID)
+			}
+		}
+
+		log.Printf("  deleting ENI %s", eniID)
+		_, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{
+			NetworkInterfaceId: aws.String(eniID),
+		})
+		if err != nil {
+			log.Printf("  warning: failed to delete ENI %s: %v", eniID, err)
+		}
+	}
+
+	return nil


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Return partial cleanup failures to the caller.

These helpers only return an error when the initial Describe* call fails. If the one blocking ENI or security group fails to detach/delete, CleanVPCForDeletion still returns nil, so DeleteVPC proceeds as if pre-cleanup succeeded and the operator loses the real reason the first stack delete can still time out.

Possible fix

+import "errors" + func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error { + var cleanupErr error + out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ Filters: []types.Filter{ {Name: aws.String("vpc-id"), Values: []string{vpcID}}, @@ _, err := client.DetachNetworkInterface(ctx, &ec2.DetachNetworkInterfaceInput{ AttachmentId: eni.Attachment.AttachmentId, Force: aws.Bool(true), }) if err != nil { log.Printf(" warning: failed to detach ENI %s: %v (will still try delete)", eniID, err) + cleanupErr = errors.Join(cleanupErr, fmt.Errorf("detach ENI %s: %w", eniID, err)) } else { waitForENIAvailable(ctx, client, eniID) } } @@ _, err := client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{ NetworkInterfaceId: aws.String(eniID), }) if err != nil { log.Printf(" warning: failed to delete ENI %s: %v", eniID, err) + cleanupErr = errors.Join(cleanupErr, fmt.Errorf("delete ENI %s: %w", eniID, err)) } } - return nil + return cleanupErr }

Apply the same pattern in deleteNonDefaultSecurityGroups.

Also applies to: 93-145

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/aws/ec2/cleanup.go` around lines 32 - 75, Surface partial cleanup failures from deleteOrphanedENIs so CleanVPCForDeletion can fail when detach/delete operations do not succeed, instead of swallowing the errors and returning nil after only logging warnings. Track any detach or delete failure in deleteOrphanedENIs and return a wrapped error at the end while preserving the existing per-ENI logging and continue-on-error behavior; apply the same pattern in deleteNonDefaultSecurityGroups so both cleanup helpers report blocking failures back to DeleteVPC.

coderabbitai · 2026-06-27T20:19:01Z

+	out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
+		Filters: []types.Filter{
+			{Name: aws.String("vpc-id"), Values: []string{vpcID}},
+		},
+	})


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Map the target file and inspect the relevant sections. ast-grep outline internal/aws/ec2/cleanup.go --view expanded printf '\n--- cleanup.go (selected lines) ---\n' sed -n '1,220p' internal/aws/ec2/cleanup.go printf '\n--- search for paginator usage in internal/aws/ec2 ---\n' rg -n "New.*Paginator|Paginator\(" internal/aws/ec2 internal -g '*.go' printf '\n--- inspect callers / surrounding cleanup flow ---\n' rg -n "CleanVPCForDeletion|deleteOrphanedENIs|deleteNonDefaultSecurityGroups" internal -g '*.go'

Repository: openshift-online/rosa-hyperfleet-cli

Length of output: 6172

🌐 Web query:

AWS EC2 DescribeNetworkInterfaces DescribeSecurityGroups pagination NextToken paginator Go SDK v2

💡 Result:

In the AWS SDK for Go v2, you handle pagination for operations like DescribeNetworkInterfaces and DescribeSecurityGroups using the service-specific paginator helpers rather than manually managing the NextToken [1][2]. The SDK provides dedicated constructors for these paginators, which simplify the process by abstracting the request/response cycle [1][2]. To implement pagination, follow this pattern: 1. Use the appropriate constructor: ec2.NewDescribeNetworkInterfacesPaginator or ec2.NewDescribeSecurityGroupsPaginator [3][4]. 2. Create the paginator by passing your client and the input parameters [1][2]. You can optionally provide a functional argument to configure settings like the page limit [3][1]. 3. Iterate using the paginator's HasMorePages and NextPage methods [3][4][1]. Example for DescribeNetworkInterfaces: paginator:= ec2.NewDescribeNetworkInterfacesPaginator(ec2Client, &ec2.DescribeNetworkInterfacesInput{ MaxResults: aws.Int32(100), }) for paginator.HasMorePages { output, err:= paginator.NextPage(ctx) if err!= nil { log.Fatalf("failed to get page: %v", err) } for _, eni:= range output.NetworkInterfaces { fmt.Println(*eni.NetworkInterfaceId) } } The paginator automatically handles the NextToken internally, storing it and applying it to subsequent calls until no more pages remain [5][1]. This is the recommended approach over manual token handling, as it reduces error-prone boilerplate and ensures standard request lifecycle management [6][1][2].

Citations:

1: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/using.html

2: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/migrate-gosdk.html

3: https://stackoverflow.com/questions/78258095/parsing-comparing-values-from-paginator-in-aws-golang-sdk

4: https://github.com/cilium/cilium/blob/8fade44c/pkg/aws/ec2/ec2.go

5: https://awscli.amazonaws.com/v2/documentation/api/2.13.38/reference/ec2/describe-security-groups.html

6: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeNetworkInterfaces.html

Use EC2 paginators for both listings. DescribeNetworkInterfaces and DescribeSecurityGroups are paginated, so a VPC with enough leaked ENIs or security groups can leave later pages untouched and still block DeleteVPC.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/aws/ec2/cleanup.go` around lines 33 - 37, The cleanup flow in describeVPCResources/deleteVPC only fetches the first page of EC2 results, so leaked ENIs or security groups on later pages can be missed. Update the DescribeNetworkInterfaces and DescribeSecurityGroups paths to use the EC2 paginator APIs in the cleanup logic, and ensure the full result set is collected before DeleteVPC proceeds.

ravitri · 2026-06-28T09:56:54Z

+func deleteOrphanedENIs(ctx context.Context, client *ec2.Client, vpcID string) error {
+	out, err := client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{
+		Filters: []types.Filter{
+			{Name: aws.String("vpc-id"), Values: []string{vpcID}},


Should we add additional filter for status as well to ensure that the ENI being filtered is "available" only and not "in-use". That might simplify some of the additional checks being added.

Reference - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeNetworkInterfaces.html#API_DescribeNetworkInterfaces_RequestParameters

status - The status of the network interface. If the network interface is not attached to an instance, the status is available; if a network interface is attached to an instance the status is in-use.

ravitri · 2026-06-28T11:14:07Z

+			// Only attempt detach for ENIs we can force-detach.
+			// ELB/Lambda-managed attachments will be cleaned up by their
+			// owning service; skip them to avoid API errors.
+			if eni.Attachment.DeleteOnTermination != nil && *eni.Attachment.DeleteOnTermination {
+				log.Printf("  skipping ENI %s (managed attachment, will be cleaned by owner)", eniID)
+				continue
+			}


Trying to understand this part better. So for ELB/Lamba ENI we will as such not have DeleteOnTermination since it's a EC2 property so we will not be able to filter out ELB and other ENI through this check right?

I came across requester-managed: true (link) which would cover the list of ENI/resources that will be automatically handled.

So in the previous comment , we could have the following filters instead?

"Name=status,Values=available"

"Name=requester-managed,Values=false"

This would filter out the ENI that can be cleaned up and are not eligible for automatic deletion. Thoughts?

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2026

jmelis mentioned this pull request Jun 27, 2026

Remove VPC delete retry loop from e2e tests openshift-online/rosa-hyperfleet-api#115

Open

3 tasks

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

ravitri reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pre-clean orphaned ENIs and SGs before VPC stack delete#85

Pre-clean orphaned ENIs and SGs before VPC stack delete#85
jmelis wants to merge 1 commit into
mainfrom
jmelis/vpc-preclean-before-delete

jmelis commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Uh oh!

coderabbitai Bot Jun 27, 2026

Uh oh!

ravitri Jun 28, 2026

Uh oh!

ravitri Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jmelis commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

ravitri Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

ravitri Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmelis commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading