fix: reconcile loop to watch migrations if earlier submission had failed #3344

hasethuraman · 2025-09-22T17:31:21Z

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:
If the label was never applied, migration won’t be monitored (this PR does not add retries for the initiation start migration failure due to any kubernetes transient issue).

Added the label‑driven recovery in a loop that re-establishes migration monitoring (emitting SKUMigrationStarted, SKUMigrationProgress, SKUMigrationCompleted events) for Premium_LRS → PremiumV2_LRS disk migrations after a controller restart or an earlier transient failure that prevented the in-memory monitor from being created.

With this the background goroutine (controller instances only, i.e. NodeID == "") sleeps 30s after startup, then every 10 minutes calls recoverMigrationMonitorsFromLabels(...).
Any PersistentVolume with label disk.csi.azure.com/migration-in-progress=true and CSI driver = this driver gets a new in-memory monitoring task (unless one is already active).

Manual recovery path:
If users see no events due to a transient start failure, they can add the label themselves and the next scan will attach monitoring.

Which issue(s) this PR fixes:

Fixes #
Migration monitor failed to start due to transient issue and so events dont show up
In below test result, we can see after 10 mins the recovery found a volume with label as I manually added to test the behaviour post transient failure.

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Special notes for your reviewer:

Release note:

none

k8s-ci-robot · 2025-09-22T17:31:26Z

@hasethuraman: The label(s) kind/resilience cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind feature
/kind resilience

What this PR does / why we need it:
If the label was never applied, migration won’t be monitored (this PR does not add retries for the initiation start migration failure due to any kubernetes transient issue).

Added the label‑driven recovery in a loop that re-establishes migration monitoring (emitting SKUMigrationStarted, SKUMigrationProgress, SKUMigrationCompleted events) for Premium_LRS → PremiumV2_LRS disk migrations after a controller restart or an earlier transient failure that prevented the in-memory monitor from being created.

With this the background goroutine (controller instances only, i.e. NodeID == "") sleeps 30s after startup, then every 10 minutes calls recoverMigrationMonitorsFromLabels(...).
Any PersistentVolume with label disk.csi.azure.com/migration-in-progress=true and CSI driver = this driver gets a new in-memory monitoring task (unless one is already active).

Manual recovery path:
If users see no events due to a transient start failure, they can add the label themselves and the next scan will attach monitoring.

Which issue(s) this PR fixes:

Fixes #
Migration monitor failed to start due to transient issue and so events dont show up

Requirements:

uses conventional commit messages

includes documentation

adds unit tests

tested upgrade from previous version

Special notes for your reviewer:

Release note:
none

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-09-22T17:31:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hasethuraman
Once this PR has been reviewed and has the lgtm label, please assign nearora-msft for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-09-22T17:31:32Z

Hi @hasethuraman. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

andyzhangx

/ok-to-test

pkg/azuredisk/azuredisk.go

hasethuraman · 2025-09-25T17:37:25Z

/retest

fix: reconcile loop to watch migrations if earlier submission had failed

3e7744b

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 22, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2025

k8s-ci-robot requested review from cvvz and landreasyan September 22, 2025 17:31

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 22, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 22, 2025

hasethuraman marked this pull request as ready for review September 23, 2025 12:19

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 23, 2025

k8s-ci-robot requested a review from nearora-msft September 23, 2025 12:19

andyzhangx reviewed Sep 23, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2025

landreasyan reviewed Sep 24, 2025

View reviewed changes

pkg/azuredisk/azuredisk.go Show resolved Hide resolved

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2025

fix: pr comments and remove event logs from migration monitor

b15d419

hasethuraman force-pushed the hari/retry-recover-migrations branch from 2b8069d to b15d419 Compare October 1, 2025 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reconcile loop to watch migrations if earlier submission had failed #3344

fix: reconcile loop to watch migrations if earlier submission had failed #3344

hasethuraman commented Sep 22, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

andyzhangx left a comment

Uh oh!

Uh oh!

hasethuraman commented Sep 25, 2025

Uh oh!

Uh oh!

fix: reconcile loop to watch migrations if earlier submission had failed #3344

Are you sure you want to change the base?

fix: reconcile loop to watch migrations if earlier submission had failed #3344

Conversation

hasethuraman commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

k8s-ci-robot commented Sep 22, 2025

Uh oh!

andyzhangx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hasethuraman commented Sep 25, 2025

Uh oh!

Uh oh!

hasethuraman commented Sep 22, 2025 •

edited

Loading