Skip to content

Add automated backups via CronJob with S3 storage and retention#29

Closed
colinmollenhour wants to merge 1 commit intomainfrom
worktree-wishlist
Closed

Add automated backups via CronJob with S3 storage and retention#29
colinmollenhour wants to merge 1 commit intomainfrom
worktree-wishlist

Conversation

@colinmollenhour
Copy link
Copy Markdown
Collaborator

Summary

Implements wishlist item #5 (automated backup and restore) for MysqlFailoverGroup. The operator now reconciles a Kubernetes CronJob that runs xtrabackup or mysqldump against the replica site and streams gzip-compressed output directly to S3, with retention by count and/or age enforced at the end of each job.

  • spec.backup — schedule, method (xtrabackup/mysqldump), S3 storage, retention, resources
  • status.backup — last backup time, site, duration, result, message
  • Backup target switches automatically on failover (reads status.activeSite)
  • Prometheus metrics: bloodraven_backup_last_success_timestamp_seconds, _last_attempt_timestamp_seconds, _duration_seconds (histogram), _total{result} (counter)
  • CronJob uses ConcurrencyPolicy: Forbid to prevent overlapping runs
  • PITR is reserved in the API (spec.backup.pitr) but emits a warning event — binlog archiving is not yet implemented
  • Restore is a documented manual procedure (xtrabackup and mysqldump runbooks in backup.go header comment)

Note: this takes a different approach than #27 (which uses mysqlsh). This one uses a CronJob + embedded shell script with xtrabackup/mysqldump streaming directly to S3.

Test plan

  • go build ./...
  • go vet ./...
  • go test -race ./... (all existing tests plus new TestReconcile_CreatesBackupCronJob, TestReconcile_BackupTargetsReplicaSite, TestReconcile_BackupCleanupWhenRemoved, TestReconcile_BackupInvalidS3Config, TestRenderBackupPrefix, TestReplicaSite, TestBackupImage, TestBackupStatusFromJob_Success, TestBackupStatusFromJob_Failure, TestReconcileBackupStatus_UpdatesFromJob, TestBackupCronJobOwnerReference)
  • make generate && make manifests — CRD and RBAC regenerated from markers
  • Apply to a dev cluster and verify: CronJob is created, backup job runs against replica, S3 object appears, status.backup is populated, retention cleanup prunes old backups
  • Failover mid-run: verify next scheduled backup targets the new replica

🤖 Generated with Claude Code

Implements spec.backup for MysqlFailoverGroup: the operator reconciles a
Kubernetes CronJob that runs xtrabackup or mysqldump against the replica
site (non-active per status.activeSite, so the target switches after
failover) and streams gzip-compressed output directly to S3. Retention
is enforced by count and/or age at the end of each backup job.

Backup state is surfaced via status.backup (last backup time, site,
duration, result, message) and Prometheus metrics: last success and
last attempt timestamps, a duration histogram, and a success/failure
counter. ConcurrencyPolicy=Forbid prevents overlapping runs. Restore
remains a documented manual procedure (xtrabackup and mysqldump runbooks
in backup.go header).

PITR (binlog archiving) is reserved in the API but not yet implemented;
setting spec.backup.pitr=true emits a warning event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 10, 2026 15:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automated MySQL backups to MysqlFailoverGroup by having the operator reconcile a Kubernetes CronJob that targets the replica site and streams backups to S3, while also surfacing status and Prometheus metrics for backup outcomes.

Changes:

  • Introduces spec.backup / status.backup API fields (CRD + Go types + deepcopy) and reconciles a per-group backup CronJob.
  • Adds embedded backup shell script + reconciliation logic to derive backup status from Jobs and emit Prometheus metrics.
  • Updates RBAC (operator + Helm chart) to allow managing batch CronJobs/Jobs.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/metrics/metrics.go Adds Prometheus metrics for backup attempt/success timestamps, duration histogram, and total counter.
internal/controller/reconciler.go Wires backup reconciliation into the main reconciler, adds RBAC markers, and sets the controller to own CronJobs.
internal/controller/backup.go Implements CronJob rendering, replica-site targeting, status derivation from Jobs, and metrics emission.
internal/controller/backup_test.go Adds unit tests covering CronJob creation, replica targeting, cleanup, prefix rendering, image selection, and status updates.
internal/controller/backup_script.sh Embedded backup/retention script that runs xtrabackup/mysqldump and uploads artifacts/metadata to S3.
config/rbac/role.yaml Grants operator ClusterRole access to batch CronJobs/Jobs (and includes policy PDB permissions).
config/crd/bases/shipstream.io_mysqlfailovergroups.yaml CRD schema updates for backup configuration and status reporting.
charts/bloodraven/templates/clusterrole.yaml Helm chart RBAC updated for batch CronJobs/Jobs.
api/v1alpha1/zz_generated.deepcopy.go Generated deepcopy updates for new backup-related API structs/fields.
api/v1alpha1/types.go Adds BackupSpec / S3 storage / retention structs and adds backup status to the CR status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +66 to +73
case "$BACKUP_METHOD" in
xtrabackup)
xtrabackup --backup --stream=xbstream \
--host="$MYSQL_HOST" --port="$MYSQL_PORT" \
--user="$USER" --password="$PASS" \
| gzip -c \
| aws $S3_FLAGS s3 cp - "$S3_URI"
;;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backup pipeline can report success even if xtrabackup or gzip fails, because POSIX sh pipelines return the exit status of the last command (aws cp). Without pipefail (not available in many /bin/sh), an upstream failure can be masked and produce a corrupted/empty backup while the Job still completes successfully. Consider switching to bash with set -o pipefail, or restructuring the script to reliably propagate failures from all pipeline stages.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +81
mysqldump)
mysqldump --single-transaction --routines --triggers --events \
--set-gtid-purged=ON \
-h "$MYSQL_HOST" -P "$MYSQL_PORT" -u "$USER" -p"$PASS" \
--all-databases \
| gzip -c \
| aws $S3_FLAGS s3 cp - "$S3_URI"
;;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pipeline-exit-status issue applies to the mysqldump path: if mysqldump/gzip fails, the pipeline may still exit 0 if aws s3 cp succeeds, leading to a "successful" Job with an invalid backup artifact. Use a shell with pipefail support or otherwise ensure failures from mysqldump/gzip propagate to the script exit code.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +30
# Parse user:password@tcp(host:port)/db style DSN into discrete pieces for the
# backup tools. Extremely simple parse — assumes the DSN was produced by the
# operator's conventions and contains user:pass@tcp(host:port).
USER=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^([^:]+):.*|\1|')
PASS=$(printf '%s' "$MYSQL_DSN" | sed -E 's|^[^:]+:([^@]*)@.*|\1|')

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing MYSQL_DSN with sed is brittle and will break for valid DSNs where the password contains @ or : (and for other DSN variants). This can cause backups to fail even though the Secret’s DSN works for the sidecar/operator. Prefer deriving MYSQL_USER/MYSQL_PASSWORD in Go using a DSN parser and passing them as separate env vars (or store them explicitly), instead of attempting to parse the DSN in shell.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +35
# aws-cli endpoint flag (optional, for MinIO / non-AWS S3)
S3_FLAGS=""
if [ -n "${S3_ENDPOINT:-}" ]; then
S3_FLAGS="--endpoint-url $S3_ENDPOINT"
fi
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3_FLAGS is constructed via string concatenation and later expanded unquoted in aws $S3_FLAGS ..., which allows unintended argument injection if S3_ENDPOINT contains spaces or leading dashes (and can also mis-handle endpoints with special characters). Prefer passing --endpoint-url as its own quoted argument (e.g., conditionally adding it as a separate word) rather than building an interpolated flag string.

Suggested change
# aws-cli endpoint flag (optional, for MinIO / non-AWS S3)
S3_FLAGS=""
if [ -n "${S3_ENDPOINT:-}" ]; then
S3_FLAGS="--endpoint-url $S3_ENDPOINT"
fi
# Invoke aws-cli with an optional endpoint URL, passing each CLI argument as
# its own shell word so S3_ENDPOINT cannot be split or interpreted as extra
# flags.
aws_cli() {
if [ -n "${S3_ENDPOINT:-}" ]; then
aws --endpoint-url "$S3_ENDPOINT" "$@"
else
aws "$@"
fi
}

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +360
// Emit metrics (idempotent for completed Jobs — gauges and counters are
// label-scoped; re-setting is safe and counter increments are guarded by
// the semantic-equality check above).
emitBackupMetrics(fg, &latest, newStatus)

return k8sretry.RetryOnConflict(k8sretry.DefaultRetry, func() error {
var fresh v1alpha1.MysqlFailoverGroup
if err := r.Get(ctx, types.NamespacedName{Namespace: fg.Namespace, Name: fg.Name}, &fresh); err != nil {
return err
}
if fresh.Status.Backup != nil && equality.Semantic.DeepEqual(fresh.Status.Backup, newStatus) {
return nil
}
fresh.Status.Backup = newStatus
if err := r.Status().Update(ctx, &fresh); err != nil {
logger.Error(err, "update backup status")
return err
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are emitted before the status update is persisted. If Status().Update fails (conflict/transient error) after emitBackupMetrics runs, the next reconcile will re-emit metrics for the same completed Job and double-increment BackupTotal/re-observe the histogram. To keep metrics idempotent, emit metrics only after a successful status write, or include a stable “observed job” identifier in status and gate metrics emission on that.

Copilot uses AI. Check for mistakes.
Comment on lines +442 to +464
// emitBackupMetrics updates Prometheus metrics based on the latest Job state.
func emitBackupMetrics(fg *v1alpha1.MysqlFailoverGroup, job *batchv1.Job, status *v1alpha1.BackupStatus) {
labels := map[string]string{"failover_group": fg.Name}

if status.LastBackupTime != nil {
metrics.BackupLastAttemptTimestamp.With(labels).Set(float64(status.LastBackupTime.Unix()))
}

switch status.LastBackupResult {
case "Success":
metrics.BackupTotal.WithLabelValues(fg.Name, "success").Inc()
if status.LastSuccessfulBackup != nil {
metrics.BackupLastSuccessTimestamp.With(labels).Set(float64(status.LastSuccessfulBackup.Unix()))
}
if status.LastBackupDurationSeconds != nil {
metrics.BackupDurationSeconds.WithLabelValues(fg.Name, backupMethod(fg)).
Observe(float64(*status.LastBackupDurationSeconds))
}
case "Failure":
metrics.BackupTotal.WithLabelValues(fg.Name, "failure").Inc()
}
_ = job
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitBackupMetrics labels duration observations using backupMethod(fg) (current spec) rather than the method that actually ran for the Job. If the method is changed while/after a Job runs, the duration metric can be mislabeled. Since job is already passed in (but currently unused), consider deriving the method label from the Job (e.g., a label/annotation or the container env in the Job template) to ensure metrics reflect the executed backup method.

Copilot uses AI. Check for mistakes.
@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

Closed in favor of #30

@colinmollenhour colinmollenhour deleted the worktree-wishlist branch April 13, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants