-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add restore validation feature: restores to special keyspace allowing validating backup/restore in single cluster (space willing) #12573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
saintstack
wants to merge
15
commits into
apple:main
Choose a base branch
from
saintstack:restore-validation-simple
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Implements restore validation using audit_storage to verify backup/restore correctness. Includes a minimal fix for the backup gap bug. Key components: - ValidateRestore audit type: compares source keys against restored keys at \xff\x02/rlog/ prefix in storage server - DD audit fixes: propagate validation errors, handle DD failover correctly - RestoreValidation and BackupAndRestoreValidation workloads for testing - Simplified backup gap fix: prevent snapshot from finishing in the same iteration it dispatches the last tasks (single flag + one check) Backup gap bug fix (FileBackupAgent.actor.cpp): The original dispatcher marks ranges as DONE when selecting them for dispatch, then immediately checks if all ranges are done. This causes snapshots to finish before the dispatched tasks complete, creating gaps in backup coverage. The fix adds a dispatchedInThisIteration flag. If tasks were dispatched in this iteration, the completion check is skipped, ensuring at least one full loop between dispatch and completion. This minimal change prevents premature snapshot completion without complex state tracking.
When too many wrong_shard_server errors occur (stale shard location data), throw audit_storage_failed instead of audit_storage_cancelled. This ensures the audit is properly marked as Failed in the database rather than staying stuck in Running state. Also add a delay before retrying to let data distribution stabilize.
The restore API can return success before all restored data is fully committed and visible to readers. Add a 5-second delay after restore completes before setting the completion marker. This prevents the validation audit from running too early and finding false mismatches due to in-flight commits.
The delay in the error path could interfere with actor cleanup or cause issues in other audit types. The retry itself should be sufficient to allow data distribution to stabilize.
When rangeLocations[].servers is empty, we were breaking out of the inner loop but continuing execution, which led to using uninitialized targetServer variable at line 4538. This caused crashes/undefined behavior. Fix: Set taskRangeBegin to skip the entire range and continue the loop, avoiding the use of uninitialized targetServer.
The actor compiler was confused by using 'state' as a loop variable name since 'state' is a keyword in actor code. Renamed to 'auditState' to avoid the conflict.
Instead of adding recursive retry actors that can multiply and cause hangs, let wrong_shard_server errors propagate up to be handled by the higher-level error handlers. This prevents concurrent actors from all incrementing retryCount simultaneously and creating retry storms.
Even if the servers map is non-empty, individual DC server vectors could be empty. This would cause randomInt(0, 0) and out-of-bounds access. Skip empty DC server vectors to prevent crashes.
After skipping empty dcServers vectors, if storageServersToCheck is still empty, it means all DC server lists were empty. In this case, targetServer would never be initialized. Skip the entire shard to prevent using uninitialized targetServer.
…iled When all audit states are Running or Failed and skipped, totalCount remains 0. The CompleteRatio calculation then divides by zero, causing a floating point exception (SIGFPE) and process crash with exit code -2. This was the root cause of the -2 crashes in general test runs. The crashes occurred when ValidateHA or ValidateReplica audits (used in general tests) hit DD failovers and temporarily had all states in Running/Failed status.
The targetServer was only set when dcid == 0, but dcid gets incremented even for empty DC server lists (via continue). So if the first DC had an empty server list, dcid would be 1 when we encounter the first non-empty DC, and targetServer would never be set, causing a crash when accessed. Fixed by using a targetServerSet flag instead of checking dcid == 0. Now targetServer is set on the FIRST non-empty DC, regardless of index. This was the root cause of -2 crashes in general test runs.
dcca4b7 to
77618b6
Compare
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Author
|
Looks like current origin/main is crashing in joshua. Seems unrelated to this PR (at least, I've tried main three times now -- gcc and clang -- and I get the below). Will come back here after we figure out whats going on in main (Need joshua to log seed, test name, and whether buggify at a very minimum...even on crash). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an implementation of a Neethu design (design is included in the PR).
Here is running the new simulation included here 100k times:
20251121-184138-stack-1458b890ad727389 compressed=True data_size=55635346 duration=3717292 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:44:53 sanity=False started=100000 stopped=20251121-192631 submitted=20251121-184138 timeout=5400 username=stackI ran all tests 100k times and looks like hangs on the end at 99975 or so. Looking to see if related.
Also verified the feature manually.