test: validate nearcore#15872 against back-migration flake (DO NOT MERGE)#3537
Draft
barakeinav1 wants to merge 16 commits into
Draft
test: validate nearcore#15872 against back-migration flake (DO NOT MERGE)#3537barakeinav1 wants to merge 16 commits into
barakeinav1 wants to merge 16 commits into
Conversation
Back-migration (A -> B -> A) can leave the destination's stored on-chain attestation past expiry by the contract's `current_time_seconds` at conclude time. `reverify_participants` then rejects `conclude_node_migration` with `InvalidTeeRemoteAttestation`, and `retry_conclude_onboarding` loops forever on the same stale state. Adds `submit_attestation_before_concluding_migration` in `tee::remote_attestation`. `execute_onboarding` calls it between keyshare import and `retry_conclude_onboarding`; the helper generates a fresh quote via `TeeAuthority` and submits it via `submit_remote_attestation`, which waits for `TransactionStatus::Executed`. By the time conclude is attempted the contract holds a fresh attestation under the destination's TLS key. Helper failures are logged but non-fatal so the non-back- migration path is unaffected. Drives the recovery path pinned by the companion contract-sandbox test `conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`. Plumbing: `TeeAuthority` and `account_public_key` are threaded `run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard` -> `execute_onboarding`.
… upstream-bug docs Two companion docs split by audience: - 2121-back-migration-e2e-flake.md (internal): full chronological investigation log — campaign-by-campaign findings, focused-repro 2^4 matrix, what we ruled out, hypothesis, recommended follow-ups. ~317 lines. - nearcore-indexer-sigkill-restart-panic.md (external): self-contained upstream bug report formatted for filing as an issue on near/nearcore. TL;DR, operational impact, affected versions, when the bug is reachable, two panic-mode stack traces, reproduction recipe pointing at this branch + commit pair, what we ruled out (incl. SIGTERM-handler + indexer-drain campaign data from #3410/#3486 confirming graceful shutdown doesn't prevent the panic), hypothesis. ~415 lines.
INTERNAL doc (2121-back-migration-e2e-flake.md): - Add a 'Latest state (as of 2026-06-08)' callout right after the title so returning readers see the current understanding without re-reading the chronology. - Fix the misleading TL;DR claim 'mpc-node has no SIGTERM handler' — it had one as of #3410, and a full indexer drain as of #3486. The panic still fires at ~80% on top of that, which is now the cleanest evidence we have that the fix has to be in nearcore. - Add the 5/5 PASS data point on PR #3486 + no #3362 trigger (re- confirming PR #3373's revert experiment) to headline statistics. - Prune 'Recommended follow-ups' — items 1, 4, 5, 6 are done; mark them as such with PR/issue links. Items 2, 3 still open with a clear recommendation on #2. EXTERNAL doc (nearcore-indexer-sigkill-restart-panic.md): - Update the 'Quick repro' branch-shape note. The reproducer commit is no longer 607e3e2 — after the rebase onto PR #3486, the trigger is at 0c02d27. nearcore is now resolved at 2.12.0 (final) via main's Cargo.lock, not 2.12.0-rc.2.
Three small, focused fixes turned up by the post-rebase CI: 1. crates/node/src/tests/common.rs — MockTransactionSender::send_and_wait was unimplemented!(). The trigger added by 1bcbb43 routes through submit_remote_attestation, which calls send_and_wait — so test_onboarding panicked at common.rs:28 'not implemented'. Mock now forwards to send() (so the transaction is still observed on the channel) and reports Executed. Real root-cause fix. 2. crates/node/src/tee/remote_attestation.rs — wrap the submit_remote_attestation call inside submit_attestation_before_concluding_migration with a 30s tokio::time::timeout. The helper is documented as best-effort and non-fatal (the caller swallows its errors); without a cap it could in principle wait MAX_RETRY_DURATION = 12 hours on an unresponsive TransactionSender. Defensive belt-and-suspenders to the mock fix above; also limits production failure-mode blast radius. 3. deny.toml — add RUSTSEC-2026-0173 to the advisories ignore list. proc-macro-error2 is the maintained successor to proc-macro-error (RUSTSEC-2024-0370, already ignored) and is unmaintained for the same reason; pulled in transitively, not directly used by us.
…ilures When the back-migration e2e fails on the upstream nearcore restart panic, the test framework's panic message previously just said 'node 0 indexer did not advance past height X within 60s'. The actual nearcore panic stack lived in node 0's per-node stderr.log on disk and was never surfaced to CI, so the only way to read it was to grep through the runner's artifact tree manually. Three changes to fix that: 1. crates/e2e-tests/src/mpc_node.rs: rotate stdout.log/stderr.log to .previous on each restart. Without rotation, File::create truncates and we lose the pre-restart context — including any panic from the pre-kill process and the last few seconds of mpc-node tracing. 2. crates/e2e-tests/tests/common.rs: in wait_for_node_indexer_height_above, on timeout dump the last 16 KB of all four log files (stdout/stderr, pre/post-restart) inline in the panic message. The post-restart stderr.log is where the upstream nearcore panic stack typically lives. Plus a read_log_tail helper that fails open with a placeholder string if a file isn't readable. 3. crates/e2e-tests/src/cluster.rs: add MpcNodeState::home_dir() so the common.rs dump can resolve a node's home directory whether the node is in Running or Stopped state at the moment of failure. Net effect: a failing CI run now prints the upstream panic stack right next to the test failure, so reviewers (and the nearcore team, once we file the upstream bug) can read it without digging through saved artifacts.
The earlier 'Affected versions' section in the external doc hedged that we hadn't run on the 2.12.0 final tag because the trigger was on a branch not yet rebased onto it. That's no longer true — PR #3362 is now stacked on PR #3486 (which is based on main = nearcore 2.12.0, commit 1144e31). Latest CI on commit 3a2ceaf captured the upstream mode-A panic inline in the test failure message via the new diagnostic. External doc: rewrite 'Affected versions' to lead with the 2.12.0 final confirmation, cite the resolved commit (1144e31) and the CI run on mpc commit 3a2ceaf. Internal doc: add the same data point to 'Latest state' as a follow-up to the earlier 607e3e2 campaign. Note that the panic-site code is unchanged between rc.1, rc.2, and final, so a single confirming run on final is enough to establish 'still affected'.
Three edits to the external doc requested for the upstream filing: 1. 'When this bug is reachable in practice' — add a 'Reproduction rate' subsection contrasting main (rare, no consistent stack) with PR #3362 (~70-80%, every failure has the same mode-A stack). Establishes the trigger as load-bearing evidence, and warns nearcore reviewers that once #3362 merges, the panic will dominate failures of this test on main. 2. 'Reproduction' — promote the failing test name to a dedicated 'Failing test' subsection at the top: migration_service__should_handle_back_migration_a_to_b_to_a in crates/e2e-tests/tests/migration_service.rs plus a short description of what the test does, what symptom the bug produces, and typical run times so reviewers know what 'normal' vs 'failing' looks like. 3. 'Reproduction' — add a 'Reproduction via CI (recommended)' section with a gh-cli recipe + how to find the panic stack in the job log via the new diagnostic, plus a 'Campaign protocol' subsection for getting N parallel data points by pushing to sister branches. Keep the pre-existing local-repro recipe alongside, retitled 'Local reproduction'.
…commended)' qualifier Both per review: 'Reproduction via CI' shouldn't editorialize about which recipe to prefer (the doc presents both equally), and the campaign- protocol detail isn't load-bearing for an external reader — anyone who needs N data points can push N times via the basic recipe.
Add a short paragraph at the bottom of the TL;DR explaining that the bug was found during e2e testing of NEAR's MPC node back-migration (A -> B -> A) flow, which inherently requires stopping and restarting the old participant. Gives nearcore reviewers immediate context for why the kill+restart pattern isn't a contrived test setup.
When the external doc is pasted into a near/nearcore issue, bare #NNNN auto-links to nearcore issues/PRs by the same number — not to the near/mpc refs we mean. Replace every bare prose 'PR #NNNN' and the one inline-code '#NNNN' with 'near/mpc#NNNN' (GitHub's canonical cross-repo ref syntax, which auto-links to the right repo regardless of where the doc is rendered). Markdown-linked refs ([text](URL)) and references inside fenced code blocks were already safe — only the bare prose mentions needed the qualifier. The one remaining bare '#2121' on line 239 is inside an inline-code quote of the actual commit title; inline code doesn't auto-link, so left as-is.
…tream-bug doc The remote_attestation.rs reference used ../../crates/... which only resolves correctly when the doc is read from docs/investigation/ in near/mpc. When pasted into a near/nearcore issue, that path 404s. Replace with an absolute github.com URL pinned to PR #3362's branch (barak/2121-contract-stale-attestation-test), which is alive for the lifetime of that PR.
Filed the upstream bug as near/nearcore#15867. Three back-references on this branch: - External doc: header banner under the title pointing at the upstream issue, so anyone landing on the doc via search sees the filed-status immediately. - Internal doc 'Latest state': replace 'ready to file against near/nearcore' with the actual issue number. - Internal doc 'Recommended follow-ups' item 4: mark the open-an-upstream-issue item as done with the issue number.
Filed-as-near/nearcore#15867 backlinks now also point at the public Zulip discussion thread so the doc carries both the upstream issue (canonical) and the discussion context (where triage and follow-up happen day-to-day). External doc: extend the header banner. Internal doc: extend the 'Latest state' bullet.
…fails The test asserts mpc-node exits with code 0 after SIGTERM. When the process aborts mid-shutdown instead (which we just saw in CI: signal=6 SIGABRT), the only info we had was the exit signal — mpc-node's logs live in the test's tempdir which gets cleaned up before we can read it. Inline the last 16 KB of node 0's stdout.log (tracing output) and stderr.log (panic backtraces) into the panic message so future failures carry the pre-abort context directly in the CI job log. Also adds MpcNodeState::home_dir() — the existing accessor pattern (account_id, p2p_public_key, ...) over the Running/Stopped variants — since the test needs to read logs from the node's home dir after terminate consumed the MpcNode into the Stopped variant.
… fix Points all near-* / nearcore deps at the slavas/fix-no-chunk-extra-at-memtrie-load branch (PR near/nearcore#15872) so CI exercises the proposed upstream fix for the back-migration restart panic tracked in near/nearcore#15867.
nearcore PR #15872's branch (post-2.12.0 master) added an actor_system parameter to Indexer::start_near_node, which previously constructed ActorSystem::new() internally. Pass it explicitly to keep behavior identical.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Do not merge. Test branch to validate upstream near/nearcore#15872 against the back-migration flake tracked in near/nearcore#15867.
Composition
fix(node): refresh attestation before concluding back-migration), rebased onto currentmain. Drops the two stale commits superseded by merged fix(node): handle SIGTERM for graceful shutdown on operator stop #3410 + fix(node): drain embedded near-indexer thread on graceful shutdown #3486.stdout.log/stderr.logtails.near-*/nearcoreCargo workspace dep fromtag = "2.12.0"tobranch = "slavas/fix-no-chunk-extra-at-memtrie-load"(PR #15872's branch, currently HEADb6d5378a).Test plan
CI campaign across 20 sister branches
barak/nc15872-test-run-{1..20}at this branch's SHA. Watchingmigration_service__should_handle_back_migration_a_to_b_to_a. Historical failure rate without the fix is ~80%; if #15872 lands the fix, expect drastically fewer.