Enable CNPG in-place updates and add operator upgrade test in CI workflow#272
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enables CNPG (CloudNative-PG) instance manager in-place updates by setting the ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES environment variable to true in the CNPG operator deployment. This is accomplished by adding the environment variable to the Helm chart values, and includes a CI verification step to ensure the configuration is correctly propagated.
Changes:
- Configure CNPG operator to enable instance manager in-place updates via Helm values
- Add CI verification step to validate the environment variable is correctly set in the deployed CNPG operator
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| operator/documentdb-helm-chart/values.yaml | Adds additionalEnv configuration to pass ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true to the CNPG operator |
| .github/actions/setup-test-environment/action.yml | Adds verification step to confirm the environment variable is correctly set in the CNPG deployment after operator installation |
- Set ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=true in cloudnative-pg Helm values to enable instance manager in-place updates - Add CI step in setup-test-environment to verify the env var is correctly propagated to the CNPG operator deployment Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
0d4e1d2 to
b27f5c8
Compare
- Add DOCUMENTDB_COMBINED_IMAGE env var for released operator (no ImageVolume) - Add 'Determine initial DocumentDB image' step that checks released chart CRD for postgresImage field to decide between combined vs extension image - Update setup-test-environment to use dynamically resolved DOCUMENTDB_INITIAL_IMAGE - Fix Step 2 baseline checks to use INITIAL_IMAGE and flexible schema version check - Fix sidecar-injector verification to query cnpg-system namespace - Fix CRD name from db.documentdb.io to dbs.documentdb.io - Add design docs for dual-mode deployment, rollback support, and unified upgrade Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
bdc7888 to
8d7715e
Compare
Replace CRD postgresImage field check with simple semver comparison. Versions <= 0.1.3 use combined image, versions > 0.1.3 use extension image. Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
When using a combined image with older released operator versions, status.schemaVersion may not be populated. Remove the hard failure and just log the value. Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ed combined mode When the released operator version <= 0.1.3 deploys in combined mode, the cluster must be recreated under the upgraded operator to switch to ImageVolume mode before testing extension/gateway upgrades. - Persist USE_COMBINED to GITHUB_ENV for later step branching - Add 4 conditional steps (gated by USE_COMBINED == true): 1. Recreate cluster: delete combined-mode CR, create fresh with extension image 2. Setup port forwarding for re-seeding 3. Re-seed test data (2 documents) 4. Cleanup port forwarding - Update DOCUMENTDB_INITIAL_IMAGE after recreation so Step 2 baseline passes - All new steps marked with TODO: Remove once released version > 0.1.3 Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
f2974a4 to
79ea1bf
Compare
…Version validation Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
79ea1bf to
1c65ed9
Compare
# Conflicts: # .github/actions/setup-test-environment/action.yml # .github/workflows/test-upgrade-and-rollback.yml
7d31aa3 to
88939d8
Compare
…mode When the released operator version <= 0.1.3 uses combined mode, the gateway image must be a publicly available image (the combined image), not the CI-built gateway image which only exists locally. Signed-off-by: Wenting Wu <wentingwu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The helm chart defaults to imagePullPolicy=Always, but during the upgrade test the CI-built operator and sidecar images are only available locally in the Kind cluster (loaded via kind load docker-image). Set pullPolicy to IfNotPresent so pods use the local images instead of trying to pull from GHCR where they don't exist. Signed-off-by: Wenting Wu <wentingwu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xgerman
left a comment
There was a problem hiding this comment.
Code Review — PR #272: Enable CNPG in-place updates and add operator upgrade test
Files: 3 changed (+404, −22) — action.yml, test-upgrade-and-rollback.yml, values.yaml
Summary
This PR does two things:
-
Enables CNPG in-place updates by setting
ENABLE_INSTANCE_MANAGER_INPLACE_UPDATES=truein the Helm chart's CNPG sub-chart values. This allows CNPG's instance manager to update pods without rolling restarts. -
Adds an operator control-plane upgrade test (Step 1) to the upgrade/rollback CI workflow. The test installs the operator from the public Helm repo ("released" version), deploys a DocumentDB cluster, then upgrades the operator to the locally-built version via
helm upgrade. It verifies DB pods aren't restarted and data persists.
The version-gating logic for combined→ImageVolume transition (≤0.1.3) is well thought out with clear TODO markers.
Overall: Well-structured approach. Several issues need attention before merge.
🔴 Critical
1. Recreated cluster spec is missing documentDbCredentialSecret
In the "Recreate cluster for ImageVolume mode" step, the inline DocumentDB YAML doesn't include documentDbCredentialSecret:
spec:
nodeCount: ...
documentDBImage: $OLD_EXTENSION
gatewayImage: $OLD_GATEWAY
resource:
storage:
pvcSize: 5Gi
storageClass: csi-hostpath-sc
exposeViaService:
serviceType: ClusterIP
# Missing: documentDbCredentialSecretThe setup action creates a secret named documentdb-credentials. If the operator requires this reference to configure auth, the re-seed mongosh step will fail with auth errors. Verify the operator picks up documentdb-credentials by convention, or add the field explicitly.
2. PVC conflict risk during cluster recreation
When deleting and immediately recreating the cluster with the same name/namespace, orphaned PVCs may interfere. The step waits for pods to terminate but doesn't clean up PVCs. CNPG uses OwnerReference for PVC management — if PVCs aren't garbage-collected before the new cluster starts, it may fail with binding errors. Add:
echo "Cleaning up old PVCs..."
kubectl delete pvc -n $DB_NS -l cnpg.io/cluster=$DB_NAME --wait=true --timeout=60s || true🟠 Major
3. Hardcoded DOCUMENTDB_COMBINED_IMAGE uses Microsoft internal registry
DOCUMENTDB_COMBINED_IMAGE: ghcr.io/microsoft/documentdb/documentdb-local:16This is from ghcr.io/microsoft/documentdb — not the public ghcr.io/documentdb registry. If this registry requires authentication, the CI will fail on external contributor PRs (which don't get GITHUB_TOKEN with package read access to Microsoft's org). Either verify it's publicly pullable or mirror to ghcr.io/documentdb.
4. RELEASED_CHART_VERSION has no workflow_dispatch override
Unlike IMAGE_TAG which has an input override, RELEASED_CHART_VERSION is hardcoded to 'latest'. For testing specific release upgrade scenarios (e.g., from 0.1.2), add a workflow_dispatch input:
workflow_dispatch:
inputs:
released_chart_version:
description: 'Released chart version to upgrade from (default: latest)'
required: false
type: string
default: 'latest'5. Version comparison may fail on pre-release strings
The sort -V comparison works for clean semver but may produce incorrect results for pre-release versions. 0.1.3-rc1 would sort as > 0.1.3 with sort -V, incorrectly choosing extension mode when combined mode is needed. Consider stripping pre-release suffixes before comparison.
6. CRD name bug fix is buried in refactoring
The step changed kubectl get crd db.documentdb.io → kubectl get crd dbs.documentdb.io. This is a real bug fix (the CRD is dbs.documentdb.io), but it's mixed into the verify step refactoring. Please call this out in the PR description since it changes behavior for all install paths.
🟡 Minor
7. Inconsistent conditional syntax
The Cleanup port forwarding after re-seeding step uses:
if: ${{ always() && env.USE_COMBINED == 'true' }}But Re-seed test data uses bare:
if: env.USE_COMBINED == 'true'Pick one style. ${{ }} is preferred when using functions like always().
8. TODO comments should reference issue numbers
# TODO: Remove this step once release versions > 0.1.3
# TODO: Remove once we deprecate combined modeReference a tracking issue: # TODO(#271): Remove once... so these don't get forgotten.
9. Shell variable interpolation in single-quoted bash -c strings
timeout 300 bash -c '
while true; do
DB_STATUS=$(kubectl get documentdb '$DB_NAME' -n '$DB_NS' ...)The '$DB_NAME' pattern breaks out of single quotes and re-enters, relying on concatenation. This works but is fragile. Consider passing variables as positional arguments:
timeout 300 bash -c '
while true; do
DB_STATUS=$(kubectl get documentdb "$1" -n "$2" ...)
done
' -- "$DB_NAME" "$DB_NS"🟢 Nitpicks
10. helm repo add is called twice (once in "Determine initial DocumentDB image", once in action's "Install released chart"). Acceptable for robustness but noting the duplication.
11. Step renumbering (1→2, 2→3, 3→4) is correct. ✅
12. The values.yaml additionalEnv addition is correct per CNPG chart 0.27.0's values schema. ✅
Positive Feedback
- Pod UID stability check — Verifying pod UIDs don't change during operator upgrade is a smart test that catches unintended restarts.
- Combined→extension transition handling — The version-gating logic with clear TODO markers for removal is well-designed.
- Verify step refactoring — Extracting verification into a standalone unconditional step benefits all install paths and fixes the stale CRD name bug.
- Clear step summary — The
GITHUB_STEP_SUMMARYincludes all relevant image/version info.
Questions
- Is
ghcr.io/microsoft/documentdb/documentdb-local:16publicly pullable without auth? - Does the operator default to the
documentdb-credentialssecret whendocumentDbCredentialSecretis omitted from the spec? - When the released chart cluster is deleted/recreated, are orphaned PVCs cleaned up by the garbage collector quickly enough?
/request-changes
|
Thanks for the thorough review @xgerman! I've pushed commit 39d4b52 addressing your feedback. Here's the item-by-item response: Critical1. Missing 2. PVC conflict risk ✅ Fixed. Added explicit PVC cleanup ( Major3. 4. 5. Pre-release version comparison Fixed. Added 6. CRD name bug fix Fixed. Also caught that Minor7. Inconsistent conditional syntax Fixed. Standardized to bare 8. TODO issue references Will create a tracking issue for combined mode deprecation and update the TODOs. 9. bash -c variable passing Fixed. All Nitpicks10-12: Acknowledged. The duplicate Answers to Questions
|
- Add workflow_dispatch input for released_chart_version override - Strip pre-release suffixes before semver comparison with sort -V - Add PVC cleanup before cluster recreation to prevent binding conflicts - Fix CRD name typo: db.documentdb.io to dbs.documentdb.io in verify step - Refactor bash -c blocks to use positional args instead of fragile single-quote concatenation for shell variable passing - Standardize conditional syntax (remove unnecessary expression brackets) - Wire released_chart_version input to RELEASED_CHART_VERSION env var Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
39d4b52 to
863f818
Compare
xgerman
left a comment
There was a problem hiding this comment.
#3 (still open): DOCUMENTDB_COMBINED_IMAGE uses Microsoft internal registry
DOCUMENTDB_COMBINED_IMAGE: ghcr.io/microsoft/documentdb/documentdb-local:16
This is from ghcr.io/microsoft/documentdb, not the public ghcr.io/documentdb registry. If it requires auth, external
PRs will fail. Please either:
- Confirm it's publicly pullable (and add a comment noting this), or
- Mirror to ghcr.io/documentdb and reference from there
New: Local chart path now has double verification
The refactoring extracted the verification logic into a standalone Verify operator installation step (unconditional),
but the Install DocumentDB Operator (local chart) step still has its own inline verification block (~20 lines). For
the local chart path, both run sequentially — redundant but not harmful. Consider removing the inline verification
from the local chart step to keep things DRY, or leave it with a comment noting the redundancy.
Uh oh!
There was an error while loading. Please reload this page.