Add automated backups via CronJob with S3 storage and retention#29
Add automated backups via CronJob with S3 storage and retention#29colinmollenhour wants to merge 1 commit intomainfrom
Conversation
Implements spec.backup for MysqlFailoverGroup: the operator reconciles a Kubernetes CronJob that runs xtrabackup or mysqldump against the replica site (non-active per status.activeSite, so the target switches after failover) and streams gzip-compressed output directly to S3. Retention is enforced by count and/or age at the end of each backup job. Backup state is surfaced via status.backup (last backup time, site, duration, result, message) and Prometheus metrics: last success and last attempt timestamps, a duration histogram, and a success/failure counter. ConcurrencyPolicy=Forbid prevents overlapping runs. Restore remains a documented manual procedure (xtrabackup and mysqldump runbooks in backup.go header). PITR (binlog archiving) is reserved in the API but not yet implemented; setting spec.backup.pitr=true emits a warning event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds automated MySQL backups to MysqlFailoverGroup by having the operator reconcile a Kubernetes CronJob that targets the replica site and streams backups to S3, while also surfacing status and Prometheus metrics for backup outcomes.
Changes:
- Introduces
spec.backup/status.backupAPI fields (CRD + Go types + deepcopy) and reconciles a per-group backupCronJob. - Adds embedded backup shell script + reconciliation logic to derive backup status from Jobs and emit Prometheus metrics.
- Updates RBAC (operator + Helm chart) to allow managing
batchCronJobs/Jobs.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/metrics/metrics.go | Adds Prometheus metrics for backup attempt/success timestamps, duration histogram, and total counter. |
| internal/controller/reconciler.go | Wires backup reconciliation into the main reconciler, adds RBAC markers, and sets the controller to own CronJobs. |
| internal/controller/backup.go | Implements CronJob rendering, replica-site targeting, status derivation from Jobs, and metrics emission. |
| internal/controller/backup_test.go | Adds unit tests covering CronJob creation, replica targeting, cleanup, prefix rendering, image selection, and status updates. |
| internal/controller/backup_script.sh | Embedded backup/retention script that runs xtrabackup/mysqldump and uploads artifacts/metadata to S3. |
| config/rbac/role.yaml | Grants operator ClusterRole access to batch CronJobs/Jobs (and includes policy PDB permissions). |
| config/crd/bases/shipstream.io_mysqlfailovergroups.yaml | CRD schema updates for backup configuration and status reporting. |
| charts/bloodraven/templates/clusterrole.yaml | Helm chart RBAC updated for batch CronJobs/Jobs. |
| api/v1alpha1/zz_generated.deepcopy.go | Generated deepcopy updates for new backup-related API structs/fields. |
| api/v1alpha1/types.go | Adds BackupSpec / S3 storage / retention structs and adds backup status to the CR status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case "$BACKUP_METHOD" in | ||
| xtrabackup) | ||
| xtrabackup --backup --stream=xbstream \ | ||
| --host="$MYSQL_HOST" --port="$MYSQL_PORT" \ | ||
| --user="$USER" --password="$PASS" \ | ||
| | gzip -c \ | ||
| | aws $S3_FLAGS s3 cp - "$S3_URI" | ||
| ;; |
There was a problem hiding this comment.
The backup pipeline can report success even if xtrabackup or gzip fails, because POSIX sh pipelines return the exit status of the last command (aws cp). Without pipefail (not available in many /bin/sh), an upstream failure can be masked and produce a corrupted/empty backup while the Job still completes successfully. Consider switching to bash with set -o pipefail, or restructuring the script to reliably propagate failures from all pipeline stages.
| mysqldump) | ||
| mysqldump --single-transaction --routines --triggers --events \ | ||
| --set-gtid-purged=ON \ | ||
| -h "$MYSQL_HOST" -P "$MYSQL_PORT" -u "$USER" -p"$PASS" \ | ||
| --all-databases \ | ||
| | gzip -c \ | ||
| | aws $S3_FLAGS s3 cp - "$S3_URI" | ||
| ;; |
There was a problem hiding this comment.
Same pipeline-exit-status issue applies to the mysqldump path: if mysqldump/gzip fails, the pipeline may still exit 0 if aws s3 cp succeeds, leading to a "successful" Job with an invalid backup artifact. Use a shell with pipefail support or otherwise ensure failures from mysqldump/gzip propagate to the script exit code.
| # Parse user:password@tcp(host:port)/db style DSN into discrete pieces for the | ||
| # backup tools. Extremely simple parse — assumes the DSN was produced by the | ||
| # operator's conventions and contains user:pass@tcp(host:port). | ||
| USER=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^([^:]+):.*|\1|') | ||
| PASS=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^[^:]+:([^@]*)@.*|\1|') | ||
|
|
There was a problem hiding this comment.
Parsing MYSQL_DSN with sed is brittle and will break for valid DSNs where the password contains @ or : (and for other DSN variants). This can cause backups to fail even though the Secret’s DSN works for the sidecar/operator. Prefer deriving MYSQL_USER/MYSQL_PASSWORD in Go using a DSN parser and passing them as separate env vars (or store them explicitly), instead of attempting to parse the DSN in shell.
| # aws-cli endpoint flag (optional, for MinIO / non-AWS S3) | ||
| S3_FLAGS="" | ||
| if [ -n "${S3_ENDPOINT:-}" ]; then | ||
| S3_FLAGS="--endpoint-url $S3_ENDPOINT" | ||
| fi |
There was a problem hiding this comment.
S3_FLAGS is constructed via string concatenation and later expanded unquoted in aws $S3_FLAGS ..., which allows unintended argument injection if S3_ENDPOINT contains spaces or leading dashes (and can also mis-handle endpoints with special characters). Prefer passing --endpoint-url as its own quoted argument (e.g., conditionally adding it as a separate word) rather than building an interpolated flag string.
| # aws-cli endpoint flag (optional, for MinIO / non-AWS S3) | |
| S3_FLAGS="" | |
| if [ -n "${S3_ENDPOINT:-}" ]; then | |
| S3_FLAGS="--endpoint-url $S3_ENDPOINT" | |
| fi | |
| # Invoke aws-cli with an optional endpoint URL, passing each CLI argument as | |
| # its own shell word so S3_ENDPOINT cannot be split or interpreted as extra | |
| # flags. | |
| aws_cli() { | |
| if [ -n "${S3_ENDPOINT:-}" ]; then | |
| aws --endpoint-url "$S3_ENDPOINT" "$@" | |
| else | |
| aws "$@" | |
| fi | |
| } |
| // Emit metrics (idempotent for completed Jobs — gauges and counters are | ||
| // label-scoped; re-setting is safe and counter increments are guarded by | ||
| // the semantic-equality check above). | ||
| emitBackupMetrics(fg, &latest, newStatus) | ||
|
|
||
| return k8sretry.RetryOnConflict(k8sretry.DefaultRetry, func() error { | ||
| var fresh v1alpha1.MysqlFailoverGroup | ||
| if err := r.Get(ctx, types.NamespacedName{Namespace: fg.Namespace, Name: fg.Name}, &fresh); err != nil { | ||
| return err | ||
| } | ||
| if fresh.Status.Backup != nil && equality.Semantic.DeepEqual(fresh.Status.Backup, newStatus) { | ||
| return nil | ||
| } | ||
| fresh.Status.Backup = newStatus | ||
| if err := r.Status().Update(ctx, &fresh); err != nil { | ||
| logger.Error(err, "update backup status") | ||
| return err | ||
| } |
There was a problem hiding this comment.
Metrics are emitted before the status update is persisted. If Status().Update fails (conflict/transient error) after emitBackupMetrics runs, the next reconcile will re-emit metrics for the same completed Job and double-increment BackupTotal/re-observe the histogram. To keep metrics idempotent, emit metrics only after a successful status write, or include a stable “observed job” identifier in status and gate metrics emission on that.
| // emitBackupMetrics updates Prometheus metrics based on the latest Job state. | ||
| func emitBackupMetrics(fg *v1alpha1.MysqlFailoverGroup, job *batchv1.Job, status *v1alpha1.BackupStatus) { | ||
| labels := map[string]string{"failover_group": fg.Name} | ||
|
|
||
| if status.LastBackupTime != nil { | ||
| metrics.BackupLastAttemptTimestamp.With(labels).Set(float64(status.LastBackupTime.Unix())) | ||
| } | ||
|
|
||
| switch status.LastBackupResult { | ||
| case "Success": | ||
| metrics.BackupTotal.WithLabelValues(fg.Name, "success").Inc() | ||
| if status.LastSuccessfulBackup != nil { | ||
| metrics.BackupLastSuccessTimestamp.With(labels).Set(float64(status.LastSuccessfulBackup.Unix())) | ||
| } | ||
| if status.LastBackupDurationSeconds != nil { | ||
| metrics.BackupDurationSeconds.WithLabelValues(fg.Name, backupMethod(fg)). | ||
| Observe(float64(*status.LastBackupDurationSeconds)) | ||
| } | ||
| case "Failure": | ||
| metrics.BackupTotal.WithLabelValues(fg.Name, "failure").Inc() | ||
| } | ||
| _ = job | ||
| } |
There was a problem hiding this comment.
emitBackupMetrics labels duration observations using backupMethod(fg) (current spec) rather than the method that actually ran for the Job. If the method is changed while/after a Job runs, the duration metric can be mislabeled. Since job is already passed in (but currently unused), consider deriving the method label from the Job (e.g., a label/annotation or the container env in the Job template) to ensure metrics reflect the executed backup method.
|
Closed in favor of #30 |
Summary
Implements wishlist item #5 (automated backup and restore) for
MysqlFailoverGroup. The operator now reconciles a KubernetesCronJobthat runs xtrabackup or mysqldump against the replica site and streams gzip-compressed output directly to S3, with retention by count and/or age enforced at the end of each job.spec.backup— schedule, method (xtrabackup/mysqldump), S3 storage, retention, resourcesstatus.backup— last backup time, site, duration, result, messagestatus.activeSite)bloodraven_backup_last_success_timestamp_seconds,_last_attempt_timestamp_seconds,_duration_seconds(histogram),_total{result}(counter)ConcurrencyPolicy: Forbidto prevent overlapping runsspec.backup.pitr) but emits a warning event — binlog archiving is not yet implementedbackup.goheader comment)Note: this takes a different approach than #27 (which uses mysqlsh). This one uses a CronJob + embedded shell script with xtrabackup/mysqldump streaming directly to S3.
Test plan
go build ./...go vet ./...go test -race ./...(all existing tests plus newTestReconcile_CreatesBackupCronJob,TestReconcile_BackupTargetsReplicaSite,TestReconcile_BackupCleanupWhenRemoved,TestReconcile_BackupInvalidS3Config,TestRenderBackupPrefix,TestReplicaSite,TestBackupImage,TestBackupStatusFromJob_Success,TestBackupStatusFromJob_Failure,TestReconcileBackupStatus_UpdatesFromJob,TestBackupCronJobOwnerReference)make generate && make manifests— CRD and RBAC regenerated from markers🤖 Generated with Claude Code