Skip to content

test: validate nearcore#15872 against back-migration flake (DO NOT MERGE)#3537

Draft
barakeinav1 wants to merge 16 commits into
mainfrom
barak/3362-test-with-nearcore-15872
Draft

test: validate nearcore#15872 against back-migration flake (DO NOT MERGE)#3537
barakeinav1 wants to merge 16 commits into
mainfrom
barak/3362-test-with-nearcore-15872

Conversation

@barakeinav1

Copy link
Copy Markdown
Contributor

Do not merge. Test branch to validate upstream near/nearcore#15872 against the back-migration flake tracked in near/nearcore#15867.

Composition

Test plan

CI campaign across 20 sister branches barak/nc15872-test-run-{1..20} at this branch's SHA. Watching migration_service__should_handle_back_migration_a_to_b_to_a. Historical failure rate without the fix is ~80%; if #15872 lands the fix, expect drastically fewer.

Back-migration (A -> B -> A) can leave the destination's stored on-chain
attestation past expiry by the contract's `current_time_seconds` at
conclude time. `reverify_participants` then rejects
`conclude_node_migration` with `InvalidTeeRemoteAttestation`, and
`retry_conclude_onboarding` loops forever on the same stale state.

Adds `submit_attestation_before_concluding_migration` in
`tee::remote_attestation`. `execute_onboarding` calls it between keyshare
import and `retry_conclude_onboarding`; the helper generates a fresh
quote via `TeeAuthority` and submits it via `submit_remote_attestation`,
which waits for `TransactionStatus::Executed`. By the time conclude is
attempted the contract holds a fresh attestation under the destination's
TLS key. Helper failures are logged but non-fatal so the non-back-
migration path is unaffected.

Drives the recovery path pinned by the companion contract-sandbox test
`conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`.

Plumbing: `TeeAuthority` and `account_public_key` are threaded
`run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard`
-> `execute_onboarding`.
… upstream-bug docs

Two companion docs split by audience:

- 2121-back-migration-e2e-flake.md (internal): full chronological
  investigation log — campaign-by-campaign findings, focused-repro
  2^4 matrix, what we ruled out, hypothesis, recommended follow-ups.
  ~317 lines.

- nearcore-indexer-sigkill-restart-panic.md (external): self-contained
  upstream bug report formatted for filing as an issue on near/nearcore.
  TL;DR, operational impact, affected versions, when the bug is
  reachable, two panic-mode stack traces, reproduction recipe pointing
  at this branch + commit pair, what we ruled out (incl. SIGTERM-handler
  + indexer-drain campaign data from #3410/#3486 confirming graceful
  shutdown doesn't prevent the panic), hypothesis. ~415 lines.
INTERNAL doc (2121-back-migration-e2e-flake.md):
- Add a 'Latest state (as of 2026-06-08)' callout right after the
  title so returning readers see the current understanding without
  re-reading the chronology.
- Fix the misleading TL;DR claim 'mpc-node has no SIGTERM handler' —
  it had one as of #3410, and a full indexer drain as of #3486. The
  panic still fires at ~80% on top of that, which is now the cleanest
  evidence we have that the fix has to be in nearcore.
- Add the 5/5 PASS data point on PR #3486 + no #3362 trigger (re-
  confirming PR #3373's revert experiment) to headline statistics.
- Prune 'Recommended follow-ups' — items 1, 4, 5, 6 are done; mark
  them as such with PR/issue links. Items 2, 3 still open with a
  clear recommendation on #2.

EXTERNAL doc (nearcore-indexer-sigkill-restart-panic.md):
- Update the 'Quick repro' branch-shape note. The reproducer commit
  is no longer 607e3e2 — after the rebase onto PR #3486, the trigger
  is at 0c02d27. nearcore is now resolved at 2.12.0 (final) via
  main's Cargo.lock, not 2.12.0-rc.2.
Three small, focused fixes turned up by the post-rebase CI:

1. crates/node/src/tests/common.rs — MockTransactionSender::send_and_wait
   was unimplemented!(). The trigger added by 1bcbb43 routes through
   submit_remote_attestation, which calls send_and_wait — so test_onboarding
   panicked at common.rs:28 'not implemented'. Mock now forwards to send()
   (so the transaction is still observed on the channel) and reports
   Executed. Real root-cause fix.

2. crates/node/src/tee/remote_attestation.rs — wrap the
   submit_remote_attestation call inside submit_attestation_before_concluding_migration
   with a 30s tokio::time::timeout. The helper is documented as best-effort
   and non-fatal (the caller swallows its errors); without a cap it could
   in principle wait MAX_RETRY_DURATION = 12 hours on an unresponsive
   TransactionSender. Defensive belt-and-suspenders to the mock fix
   above; also limits production failure-mode blast radius.

3. deny.toml — add RUSTSEC-2026-0173 to the advisories ignore list.
   proc-macro-error2 is the maintained successor to proc-macro-error
   (RUSTSEC-2024-0370, already ignored) and is unmaintained for the same
   reason; pulled in transitively, not directly used by us.
…ilures

When the back-migration e2e fails on the upstream nearcore restart
panic, the test framework's panic message previously just said
'node 0 indexer did not advance past height X within 60s'. The
actual nearcore panic stack lived in node 0's per-node stderr.log
on disk and was never surfaced to CI, so the only way to read it
was to grep through the runner's artifact tree manually.

Three changes to fix that:

1. crates/e2e-tests/src/mpc_node.rs: rotate stdout.log/stderr.log to
   .previous on each restart. Without rotation, File::create
   truncates and we lose the pre-restart context — including any
   panic from the pre-kill process and the last few seconds of
   mpc-node tracing.

2. crates/e2e-tests/tests/common.rs: in wait_for_node_indexer_height_above,
   on timeout dump the last 16 KB of all four log files (stdout/stderr,
   pre/post-restart) inline in the panic message. The post-restart
   stderr.log is where the upstream nearcore panic stack typically
   lives. Plus a read_log_tail helper that fails open with a
   placeholder string if a file isn't readable.

3. crates/e2e-tests/src/cluster.rs: add MpcNodeState::home_dir() so
   the common.rs dump can resolve a node's home directory whether
   the node is in Running or Stopped state at the moment of failure.

Net effect: a failing CI run now prints the upstream panic stack
right next to the test failure, so reviewers (and the nearcore
team, once we file the upstream bug) can read it without digging
through saved artifacts.
The earlier 'Affected versions' section in the external doc hedged
that we hadn't run on the 2.12.0 final tag because the trigger was on
a branch not yet rebased onto it. That's no longer true — PR #3362 is
now stacked on PR #3486 (which is based on main = nearcore 2.12.0,
commit 1144e31). Latest CI on commit 3a2ceaf captured the upstream
mode-A panic inline in the test failure message via the new diagnostic.

External doc: rewrite 'Affected versions' to lead with the 2.12.0 final
confirmation, cite the resolved commit (1144e31) and the CI run on
mpc commit 3a2ceaf.

Internal doc: add the same data point to 'Latest state' as a follow-up
to the earlier 607e3e2 campaign. Note that the panic-site code is
unchanged between rc.1, rc.2, and final, so a single confirming run
on final is enough to establish 'still affected'.
Three edits to the external doc requested for the upstream filing:

1. 'When this bug is reachable in practice' — add a 'Reproduction rate'
   subsection contrasting main (rare, no consistent stack) with PR #3362
   (~70-80%, every failure has the same mode-A stack). Establishes the
   trigger as load-bearing evidence, and warns nearcore reviewers that
   once #3362 merges, the panic will dominate failures of this test on
   main.

2. 'Reproduction' — promote the failing test name to a dedicated
   'Failing test' subsection at the top:

     migration_service__should_handle_back_migration_a_to_b_to_a
     in crates/e2e-tests/tests/migration_service.rs

   plus a short description of what the test does, what symptom the bug
   produces, and typical run times so reviewers know what 'normal' vs
   'failing' looks like.

3. 'Reproduction' — add a 'Reproduction via CI (recommended)' section
   with a gh-cli recipe + how to find the panic stack in the job log via
   the new diagnostic, plus a 'Campaign protocol' subsection for getting
   N parallel data points by pushing to sister branches. Keep the
   pre-existing local-repro recipe alongside, retitled 'Local reproduction'.
…commended)' qualifier

Both per review: 'Reproduction via CI' shouldn't editorialize about which
recipe to prefer (the doc presents both equally), and the campaign-
protocol detail isn't load-bearing for an external reader — anyone who
needs N data points can push N times via the basic recipe.
Add a short paragraph at the bottom of the TL;DR explaining that the
bug was found during e2e testing of NEAR's MPC node back-migration
(A -> B -> A) flow, which inherently requires stopping and restarting
the old participant. Gives nearcore reviewers immediate context for
why the kill+restart pattern isn't a contrived test setup.
When the external doc is pasted into a near/nearcore issue, bare
#NNNN auto-links to nearcore issues/PRs by the same number — not
to the near/mpc refs we mean. Replace every bare prose 'PR #NNNN'
and the one inline-code '#NNNN' with 'near/mpc#NNNN' (GitHub's
canonical cross-repo ref syntax, which auto-links to the right
repo regardless of where the doc is rendered).

Markdown-linked refs ([text](URL)) and references inside fenced
code blocks were already safe — only the bare prose mentions
needed the qualifier. The one remaining bare '#2121' on line 239
is inside an inline-code quote of the actual commit title; inline
code doesn't auto-link, so left as-is.
…tream-bug doc

The remote_attestation.rs reference used ../../crates/... which only
resolves correctly when the doc is read from docs/investigation/ in
near/mpc. When pasted into a near/nearcore issue, that path 404s.
Replace with an absolute github.com URL pinned to PR #3362's branch
(barak/2121-contract-stale-attestation-test), which is alive for the
lifetime of that PR.
Filed the upstream bug as near/nearcore#15867. Three back-references
on this branch:

- External doc: header banner under the title pointing at the
  upstream issue, so anyone landing on the doc via search sees the
  filed-status immediately.
- Internal doc 'Latest state': replace 'ready to file against
  near/nearcore' with the actual issue number.
- Internal doc 'Recommended follow-ups' item 4: mark the
  open-an-upstream-issue item as done with the issue number.
Filed-as-near/nearcore#15867 backlinks now also point at the
public Zulip discussion thread so the doc carries both the
upstream issue (canonical) and the discussion context (where
triage and follow-up happen day-to-day).

External doc: extend the header banner.
Internal doc: extend the 'Latest state' bullet.
…fails

The test asserts mpc-node exits with code 0 after SIGTERM. When the
process aborts mid-shutdown instead (which we just saw in CI: signal=6
SIGABRT), the only info we had was the exit signal — mpc-node's logs
live in the test's tempdir which gets cleaned up before we can read it.

Inline the last 16 KB of node 0's stdout.log (tracing output) and
stderr.log (panic backtraces) into the panic message so future failures
carry the pre-abort context directly in the CI job log.

Also adds MpcNodeState::home_dir() — the existing accessor pattern
(account_id, p2p_public_key, ...) over the Running/Stopped variants —
since the test needs to read logs from the node's home dir after
terminate consumed the MpcNode into the Stopped variant.
… fix

Points all near-* / nearcore deps at the slavas/fix-no-chunk-extra-at-memtrie-load
branch (PR near/nearcore#15872) so CI exercises the proposed upstream fix for
the back-migration restart panic tracked in near/nearcore#15867.
nearcore PR #15872's branch (post-2.12.0 master) added an actor_system
parameter to Indexer::start_near_node, which previously constructed
ActorSystem::new() internally. Pass it explicitly to keep behavior identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant