Skip to content

Conversation

hasethuraman
Copy link
Contributor

@hasethuraman hasethuraman commented Sep 22, 2025

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:
If the label was never applied, migration won’t be monitored (this PR does not add retries for the initiation start migration failure due to any kubernetes transient issue).

Added the label‑driven recovery in a loop that re-establishes migration monitoring (emitting SKUMigrationStarted, SKUMigrationProgress, SKUMigrationCompleted events) for Premium_LRS → PremiumV2_LRS disk migrations after a controller restart or an earlier transient failure that prevented the in-memory monitor from being created.

With this the background goroutine (controller instances only, i.e. NodeID == "") sleeps 30s after startup, then every 10 minutes calls recoverMigrationMonitorsFromLabels(...).
Any PersistentVolume with label disk.csi.azure.com/migration-in-progress=true and CSI driver = this driver gets a new in-memory monitoring task (unless one is already active).

Manual recovery path:
If users see no events due to a transient start failure, they can add the label themselves and the next scan will attach monitoring.

Which issue(s) this PR fixes:

Fixes #
Migration monitor failed to start due to transient issue and so events dont show up
In below test result, we can see after 10 mins the recovery found a volume with label as I manually added to test the behaviour post transient failure.
image

Requirements:

Special notes for your reviewer:

Release note:

none

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 22, 2025
@k8s-ci-robot
Copy link
Contributor

@hasethuraman: The label(s) kind/resilience cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind feature
/kind resilience

What this PR does / why we need it:
If the label was never applied, migration won’t be monitored (this PR does not add retries for the initiation start migration failure due to any kubernetes transient issue).

Added the label‑driven recovery in a loop that re-establishes migration monitoring (emitting SKUMigrationStarted, SKUMigrationProgress, SKUMigrationCompleted events) for Premium_LRS → PremiumV2_LRS disk migrations after a controller restart or an earlier transient failure that prevented the in-memory monitor from being created.

With this the background goroutine (controller instances only, i.e. NodeID == "") sleeps 30s after startup, then every 10 minutes calls recoverMigrationMonitorsFromLabels(...).
Any PersistentVolume with label disk.csi.azure.com/migration-in-progress=true and CSI driver = this driver gets a new in-memory monitoring task (unless one is already active).

Manual recovery path:
If users see no events due to a transient start failure, they can add the label themselves and the next scan will attach monitoring.

Which issue(s) this PR fixes:

Fixes #
Migration monitor failed to start due to transient issue and so events dont show up

Requirements:

Special notes for your reviewer:

Release note:

none

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hasethuraman
Once this PR has been reviewed and has the lgtm label, please assign nearora-msft for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 22, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @hasethuraman. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 22, 2025
@hasethuraman hasethuraman marked this pull request as ready for review September 23, 2025 12:19
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 23, 2025
Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2025
@hasethuraman
Copy link
Contributor Author

/retest

@hasethuraman hasethuraman force-pushed the hari/retry-recover-migrations branch from 2b8069d to b15d419 Compare October 1, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants