Skip to content

Enable CNPG in-place updates and add operator upgrade test in CI workflow#272

Open
WentingWu666666 wants to merge 11 commits intodocumentdb:mainfrom
WentingWu666666:users/wentingwu/enable-instance-manager-inplace-updates
Open

Enable CNPG in-place updates and add operator upgrade test in CI workflow#272
WentingWu666666 wants to merge 11 commits intodocumentdb:mainfrom
WentingWu666666:users/wentingwu/enable-instance-manager-inplace-updates

Conversation

@WentingWu666666
Copy link
Collaborator

@WentingWu666666 WentingWu666666 commented Feb 26, 2026

  • Set ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true in cloudnative-pg Helm values to enable instance manager in-place updates
  • Add operator upgrade test in CI workflow

@WentingWu666666 WentingWu666666 marked this pull request as ready for review February 26, 2026 19:53
Copilot AI review requested due to automatic review settings February 26, 2026 19:53
@WentingWu666666 WentingWu666666 changed the title feat: enable CNPG in-place updates and add CI verification step Enable CNPG in-place updates and add CI verification step Feb 26, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables CNPG (CloudNative-PG) instance manager in-place updates by setting the ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES environment variable to true in the CNPG operator deployment. This is accomplished by adding the environment variable to the Helm chart values, and includes a CI verification step to ensure the configuration is correctly propagated.

Changes:

  • Configure CNPG operator to enable instance manager in-place updates via Helm values
  • Add CI verification step to validate the environment variable is correctly set in the deployed CNPG operator

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
operator/documentdb-helm-chart/values.yaml Adds additionalEnv configuration to pass ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true to the CNPG operator
.github/actions/setup-test-environment/action.yml Adds verification step to confirm the environment variable is correctly set in the CNPG deployment after operator installation

- Set ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true in cloudnative-pg
  Helm values to enable instance manager in-place updates
- Add CI step in setup-test-environment to verify the env var is
  correctly propagated to the CNPG operator deployment

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch from 0d4e1d2 to b27f5c8 Compare February 27, 2026 14:40
- Add DOCUMENTDB_COMBINED_IMAGE env var for released operator (no ImageVolume)
- Add 'Determine initial DocumentDB image' step that checks released chart CRD
  for postgresImage field to decide between combined vs extension image
- Update setup-test-environment to use dynamically resolved DOCUMENTDB_INITIAL_IMAGE
- Fix Step 2 baseline checks to use INITIAL_IMAGE and flexible schema version check
- Fix sidecar-injector verification to query cnpg-system namespace
- Fix CRD name from db.documentdb.io to dbs.documentdb.io
- Add design docs for dual-mode deployment, rollback support, and unified upgrade

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch from bdc7888 to 8d7715e Compare February 27, 2026 15:27
Replace CRD postgresImage field check with simple semver comparison.
Versions <= 0.1.3 use combined image, versions > 0.1.3 use extension image.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
When using a combined image with older released operator versions,
status.schemaVersion may not be populated. Remove the hard failure
and just log the value.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ed combined mode

When the released operator version <= 0.1.3 deploys in combined mode,
the cluster must be recreated under the upgraded operator to switch to
ImageVolume mode before testing extension/gateway upgrades.

- Persist USE_COMBINED to GITHUB_ENV for later step branching
- Add 4 conditional steps (gated by USE_COMBINED == true):
  1. Recreate cluster: delete combined-mode CR, create fresh with extension image
  2. Setup port forwarding for re-seeding
  3. Re-seed test data (2 documents)
  4. Cleanup port forwarding
- Update DOCUMENTDB_INITIAL_IMAGE after recreation so Step 2 baseline passes
- All new steps marked with TODO: Remove once released version > 0.1.3

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch 2 times, most recently from f2974a4 to 79ea1bf Compare February 27, 2026 20:16
@WentingWu666666 WentingWu666666 changed the title Enable CNPG in-place updates and add CI verification step Enable CNPG in-place updates and add operator upgrade CI test Feb 27, 2026
@WentingWu666666 WentingWu666666 changed the title Enable CNPG in-place updates and add operator upgrade CI test Enable CNPG in-place updates and add operator upgrade test in CI workflow Feb 27, 2026
…Version validation

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch from 79ea1bf to 1c65ed9 Compare February 27, 2026 20:49
# Conflicts:
#	.github/actions/setup-test-environment/action.yml
#	.github/workflows/test-upgrade-and-rollback.yml
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch from 7d31aa3 to 88939d8 Compare March 18, 2026 15:37
…mode

When the released operator version <= 0.1.3 uses combined mode, the gateway
image must be a publicly available image (the combined image), not the
CI-built gateway image which only exists locally.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The helm chart defaults to imagePullPolicy=Always, but during the upgrade
test the CI-built operator and sidecar images are only available locally
in the Kind cluster (loaded via kind load docker-image). Set pullPolicy
to IfNotPresent so pods use the local images instead of trying to pull
from GHCR where they don't exist.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #272: Enable CNPG in-place updates and add operator upgrade test

Files: 3 changed (+404, −22) — action.yml, test-upgrade-and-rollback.yml, values.yaml

Summary

This PR does two things:

  1. Enables CNPG in-place updates by setting ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true in the Helm chart's CNPG sub-chart values. This allows CNPG's instance manager to update pods without rolling restarts.

  2. Adds an operator control-plane upgrade test (Step 1) to the upgrade/rollback CI workflow. The test installs the operator from the public Helm repo ("released" version), deploys a DocumentDB cluster, then upgrades the operator to the locally-built version via helm upgrade. It verifies DB pods aren't restarted and data persists.

The version-gating logic for combined→ImageVolume transition (≤0.1.3) is well thought out with clear TODO markers.

Overall: Well-structured approach. Several issues need attention before merge.


🔴 Critical

1. Recreated cluster spec is missing documentDbCredentialSecret

In the "Recreate cluster for ImageVolume mode" step, the inline DocumentDB YAML doesn't include documentDbCredentialSecret:

spec:
  nodeCount: ...
  documentDBImage: $OLD_EXTENSION
  gatewayImage: $OLD_GATEWAY
  resource:
    storage:
      pvcSize: 5Gi
      storageClass: csi-hostpath-sc
  exposeViaService:
    serviceType: ClusterIP
  # Missing: documentDbCredentialSecret

The setup action creates a secret named documentdb-credentials. If the operator requires this reference to configure auth, the re-seed mongosh step will fail with auth errors. Verify the operator picks up documentdb-credentials by convention, or add the field explicitly.

2. PVC conflict risk during cluster recreation

When deleting and immediately recreating the cluster with the same name/namespace, orphaned PVCs may interfere. The step waits for pods to terminate but doesn't clean up PVCs. CNPG uses OwnerReference for PVC management — if PVCs aren't garbage-collected before the new cluster starts, it may fail with binding errors. Add:

echo "Cleaning up old PVCs..."
kubectl delete pvc -n $DB_NS -l cnpg.io/cluster=$DB_NAME --wait=true --timeout=60s || true

🟠 Major

3. Hardcoded DOCUMENTDB_COMBINED_IMAGE uses Microsoft internal registry

DOCUMENTDB_COMBINED_IMAGE: ghcr.io/microsoft/documentdb/documentdb-local:16

This is from ghcr.io/microsoft/documentdb — not the public ghcr.io/documentdb registry. If this registry requires authentication, the CI will fail on external contributor PRs (which don't get GITHUB_TOKEN with package read access to Microsoft's org). Either verify it's publicly pullable or mirror to ghcr.io/documentdb.

4. RELEASED_CHART_VERSION has no workflow_dispatch override

Unlike IMAGE_TAG which has an input override, RELEASED_CHART_VERSION is hardcoded to 'latest'. For testing specific release upgrade scenarios (e.g., from 0.1.2), add a workflow_dispatch input:

workflow_dispatch:
  inputs:
    released_chart_version:
      description: 'Released chart version to upgrade from (default: latest)'
      required: false
      type: string
      default: 'latest'

5. Version comparison may fail on pre-release strings

The sort -V comparison works for clean semver but may produce incorrect results for pre-release versions. 0.1.3-rc1 would sort as > 0.1.3 with sort -V, incorrectly choosing extension mode when combined mode is needed. Consider stripping pre-release suffixes before comparison.

6. CRD name bug fix is buried in refactoring

The step changed kubectl get crd db.documentdb.iokubectl get crd dbs.documentdb.io. This is a real bug fix (the CRD is dbs.documentdb.io), but it's mixed into the verify step refactoring. Please call this out in the PR description since it changes behavior for all install paths.


🟡 Minor

7. Inconsistent conditional syntax

The Cleanup port forwarding after re-seeding step uses:

if: ${{ always() && env.USE_COMBINED == 'true' }}

But Re-seed test data uses bare:

if: env.USE_COMBINED == 'true'

Pick one style. ${{ }} is preferred when using functions like always().

8. TODO comments should reference issue numbers

# TODO: Remove this step once release versions > 0.1.3
# TODO: Remove once we deprecate combined mode

Reference a tracking issue: # TODO(#271): Remove once... so these don't get forgotten.

9. Shell variable interpolation in single-quoted bash -c strings

timeout 300 bash -c '
while true; do
  DB_STATUS=$(kubectl get documentdb '$DB_NAME' -n '$DB_NS' ...)

The '$DB_NAME' pattern breaks out of single quotes and re-enters, relying on concatenation. This works but is fragile. Consider passing variables as positional arguments:

timeout 300 bash -c '
while true; do
  DB_STATUS=$(kubectl get documentdb "$1" -n "$2" ...)
done
' -- "$DB_NAME" "$DB_NS"

🟢 Nitpicks

10. helm repo add is called twice (once in "Determine initial DocumentDB image", once in action's "Install released chart"). Acceptable for robustness but noting the duplication.

11. Step renumbering (1→2, 2→3, 3→4) is correct. ✅

12. The values.yaml additionalEnv addition is correct per CNPG chart 0.27.0's values schema. ✅


Positive Feedback

  • Pod UID stability check — Verifying pod UIDs don't change during operator upgrade is a smart test that catches unintended restarts.
  • Combined→extension transition handling — The version-gating logic with clear TODO markers for removal is well-designed.
  • Verify step refactoring — Extracting verification into a standalone unconditional step benefits all install paths and fixes the stale CRD name bug.
  • Clear step summary — The GITHUB_STEP_SUMMARY includes all relevant image/version info.

Questions

  1. Is ghcr.io/microsoft/documentdb/documentdb-local:16 publicly pullable without auth?
  2. Does the operator default to the documentdb-credentials secret when documentDbCredentialSecret is omitted from the spec?
  3. When the released chart cluster is deleted/recreated, are orphaned PVCs cleaned up by the garbage collector quickly enough?

/request-changes

@WentingWu666666
Copy link
Collaborator Author

Thanks for the thorough review @xgerman! I've pushed commit 39d4b52 addressing your feedback. Here's the item-by-item response:

Critical

1. Missing documentDbCredentialSecret By design, the operator defaults to documentdb-credentials when omitted (see cnpg_cluster.go:31-34 and constants.go:32). The setup-test-environment action creates a secret with exactly that name, and the existing test cluster YAML in the action also omits this field. So the recreated cluster spec is consistent with the existing pattern.

2. PVC conflict risk ✅ Fixed. Added explicit PVC cleanup (kubectl delete pvc -l cnpg.io/cluster=$DB_NAME) between deletion and recreation.

Major

3. DOCUMENTDB_COMBINED_IMAGE Microsoft registry ghcr.io/microsoft/documentdb/documentdb-local:16 is publicly pullable (no auth required). CI has been passing on external PRs with this image. This is the official published release image for v0.1.3.

4. RELEASED_CHART_VERSION input Fixed. Added released_chart_version input to both workflow_dispatch and workflow_call triggers, wired to the env var.

5. Pre-release version comparison Fixed. Added sed 's/-.*//' to strip pre-release suffixes before sort -V comparison.

6. CRD name bug fix Fixed. Also caught that action.yml had the same typo (db.documentdb.io dbs.documentdb.io) in the non-external-images install path. Will call this out in the PR description.

Minor

7. Inconsistent conditional syntax Fixed. Standardized to bare always() && env.USE_COMBINED == 'true' (without unnecessary ${{ }}).

8. TODO issue references Will create a tracking issue for combined mode deprecation and update the TODOs.

9. bash -c variable passing Fixed. All bash -c blocks now use ' -- "$VAR1" "$VAR2" positional argument pattern instead of single-quote concatenation.

Nitpicks

10-12: Acknowledged. The duplicate helm repo add is intentional for robustness.

Answers to Questions

  1. Yes, ghcr.io/microsoft/documentdb/documentdb-local:16 is publicly pullable.
  2. Yes, the operator defaults to documentdb-credentials when documentDbCredentialSecret is omitted.
  3. Added explicit PVC cleanup to ensure no orphaned PVCs interfere.

- Add workflow_dispatch input for released_chart_version override
- Strip pre-release suffixes before semver comparison with sort -V
- Add PVC cleanup before cluster recreation to prevent binding conflicts
- Fix CRD name typo: db.documentdb.io to dbs.documentdb.io in verify step
- Refactor bash -c blocks to use positional args instead of fragile
  single-quote concatenation for shell variable passing
- Standardize conditional syntax (remove unnecessary expression brackets)
- Wire released_chart_version input to RELEASED_CHART_VERSION env var

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the users/wentingwu/enable-instance-manager-inplace-updates branch from 39d4b52 to 863f818 Compare March 19, 2026 14:17
Copy link
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3 (still open): DOCUMENTDB_COMBINED_IMAGE uses Microsoft internal registry

DOCUMENTDB_COMBINED_IMAGE: ghcr.io/microsoft/documentdb/documentdb-local:16

This is from ghcr.io/microsoft/documentdb, not the public ghcr.io/documentdb registry. If it requires auth, external
PRs will fail. Please either:

  • Confirm it's publicly pullable (and add a comment noting this), or
  • Mirror to ghcr.io/documentdb and reference from there

New: Local chart path now has double verification

The refactoring extracted the verification logic into a standalone Verify operator installation step (unconditional),
but the Install DocumentDB Operator (local chart) step still has its own inline verification block (~20 lines). For
the local chart path, both run sequentially — redundant but not harmful. Consider removing the inline verification
from the local chart step to keep things DRY, or leave it with a comment noting the redundancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants