Skip to content

Support Kubernetes node upgrade and maintenance operations #305

@WentingWu666666

Description

@WentingWu666666

Summary

The DocumentDB operator does not currently expose configuration options needed for safe Kubernetes node maintenance (OS updates, hardware replacement, K8s version upgrades). This enhancement requests adding first-class support for node maintenance scenarios through CRD configuration and documentation.

Problem

When users need to drain nodes running DocumentDB pods, they may encounter issues:

  • Single-instance clusters blocked by PDBs: With instancesPerNode: 1, the operator-created PodDisruptionBudgets (PDBs) block kubectl drain, causing it to hang indefinitely. There is no way to disable or override PDBs from the DocumentDB CRD.
  • No enablePDB configuration: Users cannot toggle PDB creation from the DocumentDB custom resource, which is necessary for single-instance dev/test clusters where node drains are common.
  • No nodeMaintenanceWindow settings: There is no way to signal to the operator that a planned maintenance is in progress, which would disable self-healing and allow pods to be safely evicted and rescheduled.
  • No documented guidance: Users performing node maintenance have no official documentation on how to safely drain nodes running DocumentDB workloads.

What CNPG Provides (Reference)

CloudNativePG has comprehensive support for node maintenance via:

  • spec.enablePDB Toggle PodDisruptionBudgets on/off (needed for single-instance dev clusters)
  • spec.nodeMaintenanceWindow.inProgress Disables self-healing during planned maintenance
  • spec.nodeMaintenanceWindow.reusePVC Controls whether to wait for the node to come back (true, reuses existing PVCs) or re-clone data to a new node (false)

Reference: https://github.com/cloudnative-pg/cloudnative-pg/blob/main/docs/src/kubernetes_upgrade.md

Proposed Scope

  1. Expose enablePDB in the DocumentDB CRD Allow users to disable PDB creation, or handle it automatically based on instancesPerNode (e.g., skip PDB when instancesPerNode: 1).
  2. Consider exposing nodeMaintenanceWindow settings For advanced use cases such as bare-metal clusters or local storage setups, allow users to set inProgress and reusePVC flags.
  3. Add documentation for Kubernetes node maintenance procedures Document the recommended steps for safely performing node maintenance (drain, cordon, upgrade, uncordon) with DocumentDB clusters, covering both single-instance and multi-instance configurations.

Current Workaround

  • Multi-instance clusters (2+ instances): Users can drain nodes safely since PDBs allow one pod to be evicted at a time with automatic failover. The operator handles rescheduling automatically.
  • Single-instance clusters: There is no workaround without manually editing the underlying CNPG Cluster resource to set enablePDB: false or configure the maintenance window, bypassing the DocumentDB operator's reconciliation.

Additional Context

This is particularly important for:

  • AKS/EKS/GKE users who need to perform regular node pool upgrades
  • Dev/test environments running single-instance DocumentDB clusters
  • Bare-metal / on-prem deployments where node maintenance is manually managed

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions