Add automated backups via CronJob with S3 storage and retention by colinmollenhour · Pull Request #29 · ShipStream/bloodraven

colinmollenhour · 2026-04-10T15:56:20Z

Summary

Implements wishlist item #5 (automated backup and restore) for MysqlFailoverGroup. The operator now reconciles a Kubernetes CronJob that runs xtrabackup or mysqldump against the replica site and streams gzip-compressed output directly to S3, with retention by count and/or age enforced at the end of each job.

spec.backup — schedule, method (xtrabackup/mysqldump), S3 storage, retention, resources
status.backup — last backup time, site, duration, result, message
Backup target switches automatically on failover (reads status.activeSite)
Prometheus metrics: bloodraven_backup_last_success_timestamp_seconds, _last_attempt_timestamp_seconds, _duration_seconds (histogram), _total{result} (counter)
CronJob uses ConcurrencyPolicy: Forbid to prevent overlapping runs
PITR is reserved in the API (spec.backup.pitr) but emits a warning event — binlog archiving is not yet implemented
Restore is a documented manual procedure (xtrabackup and mysqldump runbooks in backup.go header comment)

Note: this takes a different approach than #27 (which uses mysqlsh). This one uses a CronJob + embedded shell script with xtrabackup/mysqldump streaming directly to S3.

Test plan

go build ./...
go vet ./...
go test -race ./... (all existing tests plus new TestReconcile_CreatesBackupCronJob, TestReconcile_BackupTargetsReplicaSite, TestReconcile_BackupCleanupWhenRemoved, TestReconcile_BackupInvalidS3Config, TestRenderBackupPrefix, TestReplicaSite, TestBackupImage, TestBackupStatusFromJob_Success, TestBackupStatusFromJob_Failure, TestReconcileBackupStatus_UpdatesFromJob, TestBackupCronJobOwnerReference)
make generate && make manifests — CRD and RBAC regenerated from markers
Apply to a dev cluster and verify: CronJob is created, backup job runs against replica, S3 object appears, status.backup is populated, retention cleanup prunes old backups
Failover mid-run: verify next scheduled backup targets the new replica

🤖 Generated with Claude Code

Implements spec.backup for MysqlFailoverGroup: the operator reconciles a Kubernetes CronJob that runs xtrabackup or mysqldump against the replica site (non-active per status.activeSite, so the target switches after failover) and streams gzip-compressed output directly to S3. Retention is enforced by count and/or age at the end of each backup job. Backup state is surfaced via status.backup (last backup time, site, duration, result, message) and Prometheus metrics: last success and last attempt timestamps, a duration histogram, and a success/failure counter. ConcurrencyPolicy=Forbid prevents overlapping runs. Restore remains a documented manual procedure (xtrabackup and mysqldump runbooks in backup.go header). PITR (binlog archiving) is reserved in the API but not yet implemented; setting spec.backup.pitr=true emits a warning event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds automated MySQL backups to MysqlFailoverGroup by having the operator reconcile a Kubernetes CronJob that targets the replica site and streams backups to S3, while also surfacing status and Prometheus metrics for backup outcomes.

Changes:

Introduces spec.backup / status.backup API fields (CRD + Go types + deepcopy) and reconciles a per-group backup CronJob.
Adds embedded backup shell script + reconciliation logic to derive backup status from Jobs and emit Prometheus metrics.
Updates RBAC (operator + Helm chart) to allow managing batch CronJobs/Jobs.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
internal/metrics/metrics.go	Adds Prometheus metrics for backup attempt/success timestamps, duration histogram, and total counter.
internal/controller/reconciler.go	Wires backup reconciliation into the main reconciler, adds RBAC markers, and sets the controller to own CronJobs.
internal/controller/backup.go	Implements CronJob rendering, replica-site targeting, status derivation from Jobs, and metrics emission.
internal/controller/backup_test.go	Adds unit tests covering CronJob creation, replica targeting, cleanup, prefix rendering, image selection, and status updates.
internal/controller/backup_script.sh	Embedded backup/retention script that runs xtrabackup/mysqldump and uploads artifacts/metadata to S3.
config/rbac/role.yaml	Grants operator ClusterRole access to `batch` CronJobs/Jobs (and includes policy PDB permissions).
config/crd/bases/shipstream.io_mysqlfailovergroups.yaml	CRD schema updates for backup configuration and status reporting.
charts/bloodraven/templates/clusterrole.yaml	Helm chart RBAC updated for `batch` CronJobs/Jobs.
api/v1alpha1/zz_generated.deepcopy.go	Generated deepcopy updates for new backup-related API structs/fields.
api/v1alpha1/types.go	Adds BackupSpec / S3 storage / retention structs and adds backup status to the CR status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T16:02:07Z

+case "$BACKUP_METHOD" in
+    xtrabackup)
+        xtrabackup --backup --stream=xbstream \
+            --host="$MYSQL_HOST" --port="$MYSQL_PORT" \
+            --user="$USER" --password="$PASS" \
+            | gzip -c \
+            | aws $S3_FLAGS s3 cp - "$S3_URI"
+        ;;


The backup pipeline can report success even if xtrabackup or gzip fails, because POSIX sh pipelines return the exit status of the last command (aws cp). Without pipefail (not available in many /bin/sh), an upstream failure can be masked and produce a corrupted/empty backup while the Job still completes successfully. Consider switching to bash with set -o pipefail, or restructuring the script to reliably propagate failures from all pipeline stages.

Copilot · 2026-04-10T16:02:07Z

+    mysqldump)
+        mysqldump --single-transaction --routines --triggers --events \
+            --set-gtid-purged=ON \
+            -h "$MYSQL_HOST" -P "$MYSQL_PORT" -u "$USER" -p"$PASS" \
+            --all-databases \
+            | gzip -c \
+            | aws $S3_FLAGS s3 cp - "$S3_URI"
+        ;;


Same pipeline-exit-status issue applies to the mysqldump path: if mysqldump/gzip fails, the pipeline may still exit 0 if aws s3 cp succeeds, leading to a "successful" Job with an invalid backup artifact. Use a shell with pipefail support or otherwise ensure failures from mysqldump/gzip propagate to the script exit code.

Copilot · 2026-04-10T16:02:07Z

+# Parse user:password@tcp(host:port)/db style DSN into discrete pieces for the
+# backup tools. Extremely simple parse — assumes the DSN was produced by the
+# operator's conventions and contains user:pass@tcp(host:port).
+USER=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^([^:]+):.*|\1|')
+PASS=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^[^:]+:([^@]*)@.*|\1|')
+


Parsing MYSQL_DSN with sed is brittle and will break for valid DSNs where the password contains @ or : (and for other DSN variants). This can cause backups to fail even though the Secret’s DSN works for the sidecar/operator. Prefer deriving MYSQL_USER/MYSQL_PASSWORD in Go using a DSN parser and passing them as separate env vars (or store them explicitly), instead of attempting to parse the DSN in shell.

Copilot · 2026-04-10T16:02:08Z

+# aws-cli endpoint flag (optional, for MinIO / non-AWS S3)
+S3_FLAGS=""
+if [ -n "${S3_ENDPOINT:-}" ]; then
+    S3_FLAGS="--endpoint-url $S3_ENDPOINT"
+fi


S3_FLAGS is constructed via string concatenation and later expanded unquoted in aws $S3_FLAGS ..., which allows unintended argument injection if S3_ENDPOINT contains spaces or leading dashes (and can also mis-handle endpoints with special characters). Prefer passing --endpoint-url as its own quoted argument (e.g., conditionally adding it as a separate word) rather than building an interpolated flag string.

Suggested change

# aws-cli endpoint flag (optional, for MinIO / non-AWS S3)

S3_FLAGS=""

if [ -n "${S3_ENDPOINT:-}" ]; then

S3_FLAGS="--endpoint-url $S3_ENDPOINT"

fi

# Invoke aws-cli with an optional endpoint URL, passing each CLI argument as

# its own shell word so S3_ENDPOINT cannot be split or interpreted as extra

# flags.

aws_cli() {

if [ -n "${S3_ENDPOINT:-}" ]; then

aws --endpoint-url "$S3_ENDPOINT" "$@"

else

aws "$@"

fi

}

Copilot · 2026-04-10T16:02:08Z

+	// Emit metrics (idempotent for completed Jobs — gauges and counters are
+	// label-scoped; re-setting is safe and counter increments are guarded by
+	// the semantic-equality check above).
+	emitBackupMetrics(fg, &latest, newStatus)
+
+	return k8sretry.RetryOnConflict(k8sretry.DefaultRetry, func() error {
+		var fresh v1alpha1.MysqlFailoverGroup
+		if err := r.Get(ctx, types.NamespacedName{Namespace: fg.Namespace, Name: fg.Name}, &fresh); err != nil {
+			return err
+		}
+		if fresh.Status.Backup != nil && equality.Semantic.DeepEqual(fresh.Status.Backup, newStatus) {
+			return nil
+		}
+		fresh.Status.Backup = newStatus
+		if err := r.Status().Update(ctx, &fresh); err != nil {
+			logger.Error(err, "update backup status")
+			return err
+		}


Metrics are emitted before the status update is persisted. If Status().Update fails (conflict/transient error) after emitBackupMetrics runs, the next reconcile will re-emit metrics for the same completed Job and double-increment BackupTotal/re-observe the histogram. To keep metrics idempotent, emit metrics only after a successful status write, or include a stable “observed job” identifier in status and gate metrics emission on that.

Copilot · 2026-04-10T16:02:08Z

+// emitBackupMetrics updates Prometheus metrics based on the latest Job state.
+func emitBackupMetrics(fg *v1alpha1.MysqlFailoverGroup, job *batchv1.Job, status *v1alpha1.BackupStatus) {
+	labels := map[string]string{"failover_group": fg.Name}
+
+	if status.LastBackupTime != nil {
+		metrics.BackupLastAttemptTimestamp.With(labels).Set(float64(status.LastBackupTime.Unix()))
+	}
+
+	switch status.LastBackupResult {
+	case "Success":
+		metrics.BackupTotal.WithLabelValues(fg.Name, "success").Inc()
+		if status.LastSuccessfulBackup != nil {
+			metrics.BackupLastSuccessTimestamp.With(labels).Set(float64(status.LastSuccessfulBackup.Unix()))
+		}
+		if status.LastBackupDurationSeconds != nil {
+			metrics.BackupDurationSeconds.WithLabelValues(fg.Name, backupMethod(fg)).
+				Observe(float64(*status.LastBackupDurationSeconds))
+		}
+	case "Failure":
+		metrics.BackupTotal.WithLabelValues(fg.Name, "failure").Inc()
+	}
+	_ = job
+}


emitBackupMetrics labels duration observations using backupMethod(fg) (current spec) rather than the method that actually ran for the Job. If the method is changed while/after a Job runs, the duration metric can be mislabeled. Since job is already passed in (but currently unused), consider deriving the method label from the Job (e.g., a label/annotation or the container env in the Job template) to ensure metrics reflect the executed backup method.

colinmollenhour · 2026-04-13T19:19:52Z

Closed in favor of #30

Copilot AI review requested due to automatic review settings April 10, 2026 15:56

Copilot started reviewing on behalf of colinmollenhour April 10, 2026 15:57 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

colinmollenhour closed this Apr 13, 2026

colinmollenhour deleted the worktree-wishlist branch April 13, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automated backups via CronJob with S3 storage and retention#29

Add automated backups via CronJob with S3 storage and retention#29
colinmollenhour wants to merge 1 commit intomainfrom
worktree-wishlist

colinmollenhour commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

colinmollenhour commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-# aws-cli endpoint flag (optional, for MinIO / non-AWS S3)
-S3_FLAGS=""
-if [ -n "${S3_ENDPOINT:-}" ]; then
-    S3_FLAGS="--endpoint-url $S3_ENDPOINT"
-fi
+# Invoke aws-cli with an optional endpoint URL, passing each CLI argument as
+# its own shell word so S3_ENDPOINT cannot be split or interpreted as extra
+# flags.
+aws_cli() {
+    if [ -n "${S3_ENDPOINT:-}" ]; then
+        aws --endpoint-url "$S3_ENDPOINT" "$@"
+    else
+        aws "$@"
+    fi
+}

Conversation

colinmollenhour commented Apr 10, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

colinmollenhour commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants