|
| 1 | +# ADR-012: MongoDB Migration from Bitnami |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +The NVSentinel platform relies on MongoDB as its primary data store for persisting health events. Previously, this was deployed via a heavily customized Bitnami Helm chart, vendored as a subchart within our `mongodb-store` component. |
| 6 | + |
| 7 | +While functional, this approach presented significant operational challenges: |
| 8 | +- **Upstream Image Deprecations:** As announced in the [official Bitnami containers issue #83267](https://github.com/bitnami/containers/issues/83267), effective August 28th, 2025, Bitnami moved most versioned container images to a read-only `docker.io/bitnamilegacy` registry, with no further updates or security patches. Our deployment depended on these now-archived images, materially increasing long-term maintenance and security risk. See also: [Appsmith's guidance on Bitnami image deprecation](https://docs.appsmith.com/getting-started/setup/instance-management/bitnami-image-deprecation). |
| 9 | +- **Manual Lifecycle Management:** The Bitnami chart directly created a Kubernetes `StatefulSet`. All "Day 2" operations—such as database version upgrades, scaling, member recovery, and configuration changes—were manual, brittle, and required direct `kubectl` intervention. |
| 10 | +- **Lack of Integrated Features:** Critical production features like automated backups, point-in-time recovery (PITR), and advanced monitoring were not part of the chart. Implementing them would require building and maintaining separate, complex solutions. |
| 11 | +- **Complex Templating:** To achieve our security requirements (TLS/mTLS with `cert-manager`, X.509 authentication), we had to write complex Go template logic within our `mongodb-store` chart. This included loops to generate per-replica certificates, which was fragile and hard to maintain. |
| 12 | + |
| 13 | +The goal was to modernize our database layer by adopting a Kubernetes Operator, which automates lifecycle management and provides a declarative API for managing the database. |
| 14 | + |
| 15 | +## Decision: Adopt the Percona Operator for MongoDB |
| 16 | + |
| 17 | +We decided to migrate from the Bitnami Helm chart to the **Percona Operator for MongoDB**. This decision was made after evaluating the available open-source MongoDB deployment options for Kubernetes, including the official MongoDB Community Operator and manual StatefulSet approaches. We concluded that Percona provided the best—and effectively the only viable—combination of enterprise-grade features, operational flexibility, and compatibility with our existing infrastructure requirements among fully open-source solutions. |
| 18 | + |
| 19 | +### Options Considered |
| 20 | + |
| 21 | +#### 1. Official MongoDB Community Operator |
| 22 | + |
| 23 | +- **Pros:** It is the official operator from MongoDB. |
| 24 | +- **Cons:** |
| 25 | + - **Replica Set Only (no documented sharding):** The Community Operator’s primary CRD is `MongoDBCommunity`, which targets replica set deployments; there is no documented sharded-cluster management in the Community Operator. Sharding orchestration is documented for MongoDB’s Enterprise/Atlas tooling, not the Community Operator. See: MongoDB Community Operator Helm chart and operator repo overviews. |
| 26 | + - **No Integrated Backup/PITR:** The Community Operator does not provide a built-in backup controller or PITR. MongoDB’s integrated backup automation is tied to Ops Manager/Cloud Manager (Enterprise/Atlas); there is no community-equivalent controller comparable to Percona’s PBM integration. |
| 27 | + - **More Basic CRD Surface:** The Community CRD does not document first-class support for pod `sidecars` or a direct `cert-manager` Issuer binding (e.g., no `tls.issuerConf` equivalent). TLS is typically provided via Secrets you create and manage yourself, which is workable but less integrated than Percona’s model for our use case. |
| 28 | + - **Operational Gaps:** Rolling upgrades, observability, and user management can be done, but require more manual integration and/or external systems compared to Percona’s operator. |
| 29 | + |
| 30 | + References: |
| 31 | + - MongoDB Helm charts (Community Operator): https://github.com/mongodb/helm-charts/tree/main/charts/community-operator |
| 32 | + - MongoDB Kubernetes Operator (Community) repository: https://github.com/mongodb/mongodb-kubernetes-operator |
| 33 | + - MongoDB Ops Manager (backup is an Enterprise feature): https://www.mongodb.com/docs/ops-manager/current/backup/ |
| 34 | + |
| 35 | +#### 2. Percona Operator for MongoDB |
| 36 | + |
| 37 | +- **Pros:** |
| 38 | + - **Full Topology Support:** Provides native, declarative support for both replica sets and sharded clusters, ensuring a future-proof growth path. |
| 39 | + - **Integrated, Open-Source "Day 2" Features:** Comes with built-in, declarative APIs for **Percona Backup for MongoDB (PBM)** for automated backups with **Point-in-Time Recovery (PITR)**. PITR is achieved through continuous oplog archival to remote storage, allowing restoration to any specific timestamp. The operator's CRD includes native support for **sidecar containers**, which we use to deploy `mongodb_exporter` for Prometheus metrics collection directly alongside each MongoDB pod. |
| 40 | + - **Excellent Flexibility:** The Custom Resource (CRD) API is highly configurable and designed for integration. It has first-class support for adding `sidecars` and an explicit `tls.issuerConf` block for seamless `cert-manager` integration. |
| 41 | + - **Open-Source Operator & Tooling:** The operator and backup tooling (PBM) are Apache 2.0; the database server (Percona Server for MongoDB) is SSPL. See Licensing for details. |
| 42 | + |
| 43 | + References: |
| 44 | + - Percona Operator docs (features, sharding, automation): https://docs.percona.com/percona-operator-for-mongodb/ |
| 45 | + - Backups & PITR with PBM (Operator docs): https://docs.percona.com/percona-operator-for-mongodb/backups.html |
| 46 | + - PITR configuration (oplog archival): https://docs.percona.com/percona-operator-for-mongodb/backups.html#store-operations-logs-for-point-in-time-recovery |
| 47 | + |
| 48 | +- **Cons:** |
| 49 | + - Requires another controller (the operator) running in the cluster. This is an acceptable trade-off for the automation benefits gained. |
| 50 | + |
| 51 | +#### 3. Maintain Our Own Custom Helm Chart |
| 52 | + |
| 53 | +- **Pros:** Maximum control over every configuration detail. |
| 54 | +- **Cons:** |
| 55 | + - **Highest Operational Burden:** We would have to manually implement and maintain: |
| 56 | + - Our own Helm chart templates for `StatefulSet`, `Service`, `PersistentVolumeClaim`, and networking |
| 57 | + - Scripted logic for all "Day 2" operations: version upgrades (requiring manual rolling restart strategies), horizontal scaling (adding/removing replica set members), configuration changes, and pod recovery |
| 58 | + - TLS certificate lifecycle management (generation, rotation, distribution to each pod) |
| 59 | + - MongoDB-specific operational knowledge embedded in scripts rather than leveraged from a battle-tested operator |
| 60 | + - **Container Image Management:** We would need to either: |
| 61 | + - Build and maintain our own MongoDB container images with security patches and updates, including a CI/CD pipeline for image builds, vulnerability scanning, and registry hosting |
| 62 | + - Or depend on upstream images (MongoDB Community, Percona Server for MongoDB, or similar) without operator-level lifecycle automation, requiring manual intervention for breaking changes or version migrations |
| 63 | + - **Ongoing Maintenance Costs:** Every MongoDB version upgrade, security patch, or operational pattern change would require custom chart updates, testing, and rollout procedures. This essentially means re-implementing operator functionality piecemeal, without the benefit of community testing, documentation, or support that comes with established operators like Percona's. |
| 64 | + |
| 65 | +### Licensing and Open Source Model |
| 66 | + |
| 67 | +A key factor in this decision was Percona's commitment to open source, which differs significantly from MongoDB's own licensing strategy. |
| 68 | + |
| 69 | +- **Percona Operator & Tools (Apache 2.0):** The |
| 70 | +`percona-server-mongodb-operator` itself, along with key ecosystem |
| 71 | +tools like `percona-backup-mongodb`, are licensed under the permissive |
| 72 | +Apache 2.0 license. This provides maximum flexibility and avoids vendor |
| 73 | +lock-in for the management layer. References: |
| 74 | + - Operator license: https://github.com/percona/percona-server-mongodb-operator/blob/main/LICENSE |
| 75 | + - PBM license: https://github.com/percona/percona-backup-mongodb/blob/main/LICENSE |
| 76 | +- **Percona Server for MongoDB (SSPL):** The underlying database server, `Percona Server for MongoDB`, is distributed under the **Server Side Public License (SSPL)** (Percona describes it as “source-available”). Citation: |
| 77 | + - PSMDB license: https://github.com/percona/percona-server-mongodb/blob/v8.0/LICENSE-Community.txt |
| 78 | +- **MongoDB's Licensing:** MongoDB's Community Operator is Apache 2.0, but the database it deploys is SSPL. More advanced operator features (e.g., integrated backup orchestration) live behind the Enterprise/Ops Manager/Atlas ecosystem. |
| 79 | + |
| 80 | +Percona's model provides a "best of both worlds" scenario: a permissively licensed, open-source management layer that provides enterprise-grade features (like backups) for free, while still using the SSPL-licensed database core. This avoids the licensing complexities and costs associated with MongoDB's enterprise offerings. |
| 81 | + |
| 82 | +References: |
| 83 | +- Operator docs (features, configuration, TLS, backups, sidecars): https://docs.percona.com/percona-operator-for-mongodb/index.html |
| 84 | +- Operator release notes (active maintenance cadence): https://docs.percona.com/percona-operator-for-mongodb/RN/index.html |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +## Architecture & Implementation |
| 89 | + |
| 90 | +The migration was implemented by making our `mongodb-store` chart a dual-backend system, capable of deploying either the old Bitnami chart or the new Percona stack based on a boolean flag. This de-risked the migration and showcased Percona's adaptability. |
| 91 | + |
| 92 | +### 1. Conditional Dependencies |
| 93 | + |
| 94 | +The `mongodb-store/Chart.yaml` was modified to conditionally include either Bitnami or the two Percona charts (`psmdb-operator` and `psmdb-db`): |
| 95 | + |
| 96 | +```yaml |
| 97 | +# In distros/kubernetes/nvsentinel/charts/mongodb-store/Chart.yaml |
| 98 | +dependencies: |
| 99 | + - name: mongodb # Old Bitnami chart |
| 100 | + condition: mongodb-store.useBitnami |
| 101 | + - name: psmdb-operator # Percona Operator |
| 102 | + condition: mongodb-store.usePerconaOperator |
| 103 | + - name: psmdb-db # Percona Database CRD |
| 104 | + condition: mongodb-store.usePerconaOperator |
| 105 | +``` |
| 106 | +
|
| 107 | +### 2. Declarative Configuration via Custom Resource |
| 108 | +
|
| 109 | +Instead of directly templating a `StatefulSet`, we now create a high-level `PerconaServerMongoDB` resource. This resource captures our intent, and the operator handles the low-level implementation. |
| 110 | + |
| 111 | +**Key Integrations from `values.yaml`:** |
| 112 | + |
| 113 | +- **`cert-manager` Integration:** We continue to use `cert-manager` to manage TLS certificates. Client certificates are generated via cert-manager `Certificate` resources that reference our `mongodb-psmdb-issuer`. The Percona Operator is configured to use TLS mode, and certificates are provided to the pods via Kubernetes Secrets created by cert-manager. |
| 114 | + ```yaml |
| 115 | + tls: |
| 116 | + mode: requireTLS |
| 117 | + ``` |
| 118 | + |
| 119 | +- **First-Class Sidecar Support for Metrics:** Our `mongodb_exporter` for Prometheus was cleanly integrated using the operator's native `sidecars` API. |
| 120 | + ```yaml |
| 121 | + replsets: |
| 122 | + rs0: |
| 123 | + sidecars: |
| 124 | + - name: mongodb-exporter |
| 125 | + image: percona/mongodb_exporter:0.40.0 |
| 126 | + args: |
| 127 | + - --discovering-mode |
| 128 | + - --compatible-mode |
| 129 | + - --collect-all |
| 130 | + - --web.listen-address=:9216 |
| 131 | + - --mongodb.direct-connect |
| 132 | + ports: |
| 133 | + - name: metrics |
| 134 | + containerPort: 9216 |
| 135 | + ``` |
| 136 | + |
| 137 | +### 3. Shift in TLS Management |
| 138 | + |
| 139 | +**Bitnami Approach (Previous):** |
| 140 | +- Used cert-manager to generate **per-replica server certificates** via a Go template loop (e.g., `mongo-server-cert-0`, `mongo-server-cert-1`, `mongo-server-cert-2`) |
| 141 | +- Also used cert-manager to generate **client certificates** for application connectivity |
| 142 | +- Both server and client certificates referenced our custom `mongo-ca-issuer` |
| 143 | + |
| 144 | +**Percona Approach (Current):** |
| 145 | +- The Percona Operator **auto-generates server certificates** internally when `tls.mode: requireTLS` is set, eliminating the need for per-replica certificate templates |
| 146 | +- We continue using cert-manager to generate **client certificates** (`mongo-app-client-cert`, `mongo-dgxcops-client-cert`) that reference our custom `mongodb-psmdb-issuer` |
| 147 | +- This hybrid approach simplifies our Helm templates by removing the complex per-replica server certificate loop while maintaining cert-manager integration for client authentication |
| 148 | + |
| 149 | +**Simplification Achieved:** |
| 150 | +The removal of the per-replica server certificate generation (the `{{- range $i := until $replicaCount }}` loop in `certmanager.yaml`) significantly reduced template complexity. The operator now handles server certificate provisioning and rotation automatically, while we retain full control over client certificate issuance via our existing cert-manager infrastructure. |
| 151 | + |
| 152 | +### 4. Architectural Shift Summary |
| 153 | + |
| 154 | +| Aspect | Old (Bitnami) | New (Percona Operator) | |
| 155 | +| :--- | :--- | :--- | |
| 156 | +| **Control Model** | **Imperative:** Manually define a `StatefulSet`. | **Declarative:** Define a `PerconaServerMongoDB` resource; the operator builds the `StatefulSet`.| |
| 157 | +| **Lifecycle** | **Manual:** Upgrades, scaling, and recovery are manual `kubectl` tasks. | **Automated:** The operator handles rolling upgrades, scaling, and pod self-healing. | |
| 158 | +| **Backups** | **None:** Required a separate, custom-built solution. | **Integrated:** Declarative, scheduled backups and PITR via the `backup` block in the CRD. | |
| 159 | +| **Integration** | **Brittle:** Required complex template logic in our chart to inject features. | **Flexible:** Native support for `sidecars` via a purpose-built CRD API. Server TLS certificates are auto-generated by the operator; client certificates continue to use our existing cert-manager infrastructure. | |
| 160 | + |
| 161 | +## Consequences |
| 162 | + |
| 163 | +### Positive Outcomes |
| 164 | + |
| 165 | +1. **Reduced Operational Overhead:** |
| 166 | + - **Automated Lifecycle Management:** The operator handles rolling upgrades, scaling (both horizontal and vertical), and pod self-healing without manual `kubectl` intervention. |
| 167 | + - **Simplified Helm Templates:** Removed 50+ lines of complex Go template logic for per-replica certificate generation (the `{{- range $i := until $replicaCount }}` loop in `certmanager.yaml`). |
| 168 | + - **Declarative Configuration:** Changed from imperative `StatefulSet` definitions to declarative `PerconaServerMongoDB` custom resources, making intent clearer and reducing configuration drift. |
| 169 | + |
| 170 | +2. **Enhanced Production Readiness:** |
| 171 | + - **Integrated Backup Solution:** Clear path to enabling automated backups and Point-in-Time Recovery (PITR) via Percona Backup for MongoDB (PBM), eliminating the need to build a custom backup solution. |
| 172 | + - **Native Metrics Export:** `mongodb_exporter` runs as a sidecar on each MongoDB pod, providing per-replica metrics via a clean CRD API rather than requiring a separate Deployment. |
| 173 | + |
| 174 | +3. **Future-Proof Architecture:** |
| 175 | + - **Sharding Support:** While currently using a 3-node replica set, the operator provides native sharding capabilities if we need to scale beyond a single replica set's capacity. |
| 176 | + - **Active Maintenance:** Percona Operator receives regular updates (latest: 1.21.1, released October 2025) with MongoDB 8.0 support, ensuring compatibility with current MongoDB versions. |
| 177 | + - **Escape from Deprecated Images:** No longer dependent on `docker.io/bitnamilegacy` images that moved to a read-only registry in August 2025. |
| 178 | + |
| 179 | +### Trade-offs and Considerations |
| 180 | + |
| 181 | +1. **MongoDB-Specific Operator Lock-in:** |
| 182 | + - **Consideration:** We're now dependent on Percona's operator for database management. If we want to switch to a different operator in the future, that would require another migration. |
| 183 | + - **Mitigation:** Percona Operator is open source (Apache 2.0), actively maintained, and has a strong community. |
| 184 | + |
| 185 | +## References |
| 186 | + |
| 187 | +- Percona Operator for MongoDB documentation (features, configuration, TLS, backups, sidecars): https://docs.percona.com/percona-operator-for-mongodb/index.html |
| 188 | +- Percona Operator for MongoDB release notes (active maintenance cadence): https://docs.percona.com/percona-operator-for-mongodb/RN/index.html |
| 189 | +- Percona Server for MongoDB 8.0 release notes (source-available build, compatibility): https://docs.percona.com/percona-server-for-mongodb/8.0/release_notes/8.0.12-4.html |
| 190 | +- Percona Operator license (Apache 2.0): https://github.com/percona/percona-server-mongodb-operator/blob/main/LICENSE |
| 191 | +- Percona Backup for MongoDB (PBM) license (Apache 2.0): https://github.com/percona/percona-backup-mongodb/blob/main/LICENSE |
| 192 | +- Percona Server for MongoDB license (SSPL): https://github.com/percona/percona-server-mongodb/blob/v8.0/LICENSE-Community.txt |
0 commit comments