v2: rewrite#125
Conversation
013145d to
6a9c2c6
Compare
c4d335d to
a623a58
Compare
a623a58 to
c055e29
Compare
|
|
|
It’d be good to have at least some tests for the migration path (v1 -> v2) and a few /status + helius /unverify cases, since the current CI is mostly cargo test on a dummy DB and may not catch regressions from this rewrite. |
7b70bcb to
50c0131
Compare
Dropping verified_programs / program_authority / solana_program_builds in 0001 -- which runs automatically at app startup -- made a bad v2 deploy unrecoverable without restoring from backup. Keep the v1 tables so the data survives the cutover (a rollback to the v1 binary still has it), and document the deploy steps and the deferred manual drop in the migration file itself. Addresses review feedback on otter-sec#125. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dropping verified_programs / program_authority / solana_program_builds in 0001 -- which runs automatically at app startup -- made a bad v2 deploy unrecoverable without restoring from backup. Keep the v1 tables so the data survives the cutover (a rollback to the v1 binary still has it), and document the deploy steps and the deferred manual drop in the migration file itself. Addresses review feedback on otter-sec#125. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7aac517 to
ed86bbd
Compare
Dropping verified_programs / program_authority / solana_program_builds in 0001 -- which runs automatically at app startup -- made a bad v2 deploy unrecoverable without restoring from backup. Keep the v1 tables so the data survives the cutover (a rollback to the v1 binary still has it), and document the deploy steps and the deferred manual drop in the migration file itself. Addresses review feedback on otter-sec#125. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ed86bbd to
c7f92e7
Compare
|
@meumar-osec updated the PR descriptions and opened the remaining PRs in the stack (#126, #127 , #128). They're still targeting main in order to get the CI to run properly. I updated the migration to retain the old tables instead of deleting, so worst case we should be able to rollback still. I also added a bunch of integration tests in #128 , including for the migration itself. Let me know if theres anything else you'd like to change! I'd be happy to hop in a call and walk you through the changes if you'd like too. |
Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Leftover from v1's global Lazy pattern; v2 injects config via AppState. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
The handler used to do two sequential queries — `get_program_state` for the cached on-chain hash + frozen/closed flags, then `best_build` for the matching completed build — and stitch the result through `VerificationResponse::from_state_and_build`. Replace both with one `LEFT JOIN LATERAL` that returns just the seven columns the response needs, render the `ExtendedStatusResponse` inline, and serialize once. Drops `best_build` (its only caller is gone) and shapes the handler to send the pre-encoded body via `(StatusCode, [(CONTENT_TYPE, "application/json")], json)` so axum doesn't re-encode anything. One round-trip per `/status` request, no struct→JSON shuffle, same external response shape. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Adds a moka cache (`Cache<Address, String>`) of pre-serialized `/status` bodies. Hits skip the LATERAL join and the JSON encode; misses run the existing path and write back the rendered body. Every mutating method invalidates the affected program's entry: - `upsert_program_state` (sweep + post-build snapshot) - `unverify_program` (upgrade webhook) - `mark_closed` (close webhook + sweep) - `mark_build_completed` (build finished — added `program_id` arg so we have the key without an extra lookup) TTL is bound to the sweep interval. Every sweep cycle upserts every `program_state` row, which evicts the matching cache entry — so a longer TTL would never fire, and a shorter one just adds DB load between sweeps. Capacity capped at 10k entries (LRU eviction beyond that, which the verified-program set will not reach in any realistic timeframe). Restart self-heals: the cache is empty on boot, the sweep job fires immediately at startup and refreshes `program_state` from chain before the first cached entry is written. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Replaces the manual get → DB query → insert pattern with `try_get_with`. moka's recommended path for "fetch or compute" coalesces concurrent calls on the same cold key into one computation — without it, 100 concurrent /status requests for an uncached program would each run the LATERAL join independently before any could write back. Same external behaviour, fewer DB hits under contention. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
`check_is_verified` (/status) filters builds to trusted signers, but `get_verification_status_all` (/verified-programs-status) didn't -- it returned any completed build's repo/commit, so an untrusted signer could surface its metadata there. v1 routed this endpoint through check_is_verified per program, so it was trust-filtered; this restores that. Extracts `onchain::trusted_signers()` (system_program::ID + SIGNER_KEYS) now that two callers need it, and uses it in both. The live upgrade authority stays matched in SQL via program_state.authority. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
NewBuild::from(&OtterBuildParams) left signer None, so every caller (both verify handlers, pda_worker) had to set it afterward -- a footgun where forgetting yields a null signer. The PDA is derived from its signer and the on-chain program stores that same key in the params, so it's authoritative: From now sets it directly. Drops the redundant overrides and the separately-threaded signer: setup_verification returns just the NewBuild (no more VerificationSetup struct), and process_verification / process_verification_sync lose their signer parameter. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
The sweep now backstops missed /pda webhooks. When a refresh observes a program's on-chain hash change, upsert_program_state flags pending_reverify; the sweep then drains up to max_reverifies_per_sweep flagged programs (config, default 3), fetches each one's current Otter Verify PDA, and kicks a build through the same execute path as the verify endpoints -- unless an identical build already exists (any status, so failures aren't retried). The flag persists across cycles, so a capped burst drains over several sweeps rather than being dropped, and is cleared once a program is handled so stuck programs aren't re-examined until they drift again. unverify_program also sets the flag, since it advances the stored hash itself and the sweep's drift check would otherwise miss it. Adds the pending_reverify column to program_state, a has_build_for_params guard (any-status sibling of find_duplicate, sharing one query via an include_failed toggle), and the AppState plumbing to give the sweep loop what execute needs. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Two small correctness fixes carried over from the v2 rewrite: - Hash the program-data bytecode when the account is exactly the header size (`>=` not `>`), matching `solana-verify`'s `data.get(45..)`. Only affects a zero-bytecode account, but it was an off-by-one. - Trim the `/verified-programs` search term before building the ILIKE pattern, so a space-padded (but valid) query still matches. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Two v1-parity fixes for behavior the rewrite changed silently: - /unverify: skip programs with no completed build. Helius watches every program upgrade, so the handler fires for programs we never verified; the upsert in unverify_program would otherwise create a junk program_state row for each (and the sweep would then chase it). v1 bailed here implicitly -- its get_verified_build errored for unknown programs and its UPDATE was a no-op -- so restore that gate explicitly. - /verified-programs and /verified-programs-status: stop excluding frozen programs (keep excluding closed). /status already reports a frozen-but-matching program as verified, so the lists now agree. Immutable (legacy-loader) programs are frozen and shouldn't be hidden from the directory. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Return None for a program with no program_state row (or NULL hash) rather than an empty string. The "" sentinel made "untracked" and "tracked, hash drifted" indistinguishable, which different callers want to treat differently. Each caller now handles None explicitly: - pda_worker / unverify: None (no cached hash) compares unequal to a real fetched hash, same as before -- behavior unchanged. - job_status: falls back to "" for the response field. Pure refactor; no behavior change. Co-Authored-By: stegaBOB <41593264+stegaBOB@users.noreply.github.com>
Dropping verified_programs / program_authority / solana_program_builds in 0001 -- which runs automatically at app startup -- made a bad v2 deploy unrecoverable without restoring from backup. Keep the v1 tables so the data survives the cutover (a rollback to the v1 binary still has it), and document the deploy steps and the deferred manual drop in the migration file itself. Addresses review feedback on otter-sec#125. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c7f92e7 to
8e75e22
Compare
v2 rewrite. Same
api/src/...directory layout as master so per-file diffs read in place; #126 stacks on top to flatten, #127 adds/resolve-hash, #128 adds integration tests.Schema
solana_program_builds(job) +verified_programs(result) merge into onebuildstable per attempt, withexecutable_hash,error_message, and abuild_statusENUM.verified_programs.on_chain_hash+program_authoritymerge into oneprogram_staterow per program;is_verifiedis now a derived expression.program_statealso gains apending_reverifyflag with no v1 equivalent: the sweep sets it when a program's on-chain hash drifts and drains it by reverifying (capped per cycle).build_logs.program_addressrenamedprogram_id;idretyped VARCHAR -> UUID.Migration runs inline at startup via
db.migrate(), idempotent. It copies the v1 data into the new tables but retains them -- the drop is deferred to a follow-up so the cutover stays reversible (a bad deploy can be rolled back by redeploying the v1 binary instead of restoring from backup). See the deploy note at the end of0001_init.sql.Runtime
/status/{program_id}was Redis-cached (5min TTL) with a per-request RPC call to fetch program authority on cache miss. v2 caches in-process via moka (invalidated on every write) and is DB-only at request time. OneRpcClient, no rotation;rpc_manager.rsgone.State mutation moves off the read path entirely. v1's
check_is_verifiedspawnedreverify_programin the background on hash mismatch — a fullsolana-verifyDocker build kicked off from a read endpoint. v2'scheck_is_verifiedis strictly read-only; rebuilds fire from write paths and the background sweep, never from a read:/pdawebhook: refetch hash, unverify if changed, rebuild against fresh PDA params./unverifywebhook: record new hash, no rebuild (PDA still points at old commit).program_statevia batchedgetMultipleAccounts, and reverifies programs whose on-chain hash drifted (capped per cycle, skipping params already built) -- the backstop for upgrades a webhook missed./verify-with-signer: manual fallback when a webhook is missed and the developer notices stale/status.Trade-off: between a missed webhook and the next sweep tick,
/statuscan showis_verified=trueagainst stale metadata, bounded bysweep_interval_seconds-- the sweep then reverifies the drifted program and self-heals. Previous behavior had the inverse trade-off: every drifted-program cache miss triggered a Docker build whether anyone needed it or not.Other
solana-sdkumbrella -> split crates.AppStateinstead of globalLazys.db-initcontainer. Migrations run at startup, so take a DB backup and deploy in a maintenance window; the v1 tables are kept for rollback and dropped via a manual follow-up once v2 is healthy.Perf
Local bench, 336 seeded programs, 4-cores, rate limits dropped.
/status/{program_id}warm path:/verified-programs-status: v1 timed out at 60s+ (per-program fan-out); v2 ~5ms./statuscold path: v2 ~218ms; v1 worst case seconds.