Skip to content

fix(discovery): tune RPC timeout and failure threshold from production data#239

Open
uibeka wants to merge 1 commit intoConway-Research:mainfrom
uibeka:fix/discovery-rpc-timeout-tuning
Open

fix(discovery): tune RPC timeout and failure threshold from production data#239
uibeka wants to merge 1 commit intoConway-Research:mainfrom
uibeka:fix/discovery-rpc-timeout-tuning

Conversation

@uibeka
Copy link
Contributor

@uibeka uibeka commented Feb 27, 2026

Problem

PER_CHUNK_TIMEOUT_MS (3s) and MAX_CONSECUTIVE_FAILURES (2) in getRegisteredAgentsByEvents() were set pre-production in PR #228. Production operation on Base's public RPC revealed both values are too aggressive for real-world latency:

  • eth_getLogs on recent block ranges regularly exceeds 3 seconds (3-6s observed). These are not failures — the RPC is responding, just slower on recent blocks that aren't fully indexed yet. The 3-second Promise.race timeout treats them as failures.

  • Two consecutive timeouts on the newest chunks aborts the entire scan. The scanner processes newest blocks first. If the first two chunks are slow (common for recent blocks), the scanner gives up before ever reaching the older blocks where agent mint events actually live. Result: discover_agents reports 0 agents even though 20+ exist on-chain.

Production Evidence

Observed across two agents running on Base mainnet (Feb 26):

  • Scanner hits 2 consecutive chunk timeouts on recent blocks → "Too many consecutive chunk failures, stopping scan" → discover_agents returns 0 agents
  • Agents loop on empty discovery results or escalate to expensive fallback operations
  • After patching to 8s/5-failures locally: 10+ successful discovery calls, scanner finds all 20+ agents consistently

Fix

Two constant value changes in getRegisteredAgentsByEvents():

Constant Before After Rationale
PER_CHUNK_TIMEOUT_MS 3_000 (3s) 8_000 (8s) Accommodates observed Base RPC latency (3-6s) with margin
MAX_CONSECUTIVE_FAILURES 2 5 Tolerates transient slow chunks on recent blocks without abandoning scan

Scope

  • 1 file changed (src/registry/erc8004.ts) — 2 constant values updated
  • 1 test file (src/__tests__/loop.test.ts) — timeout bump from 30s default to 180s to accommodate corrected scanning duration. The discover_agents loop test makes real RPC calls to Base mainnet; with correct timeout values, worst-case scan path is ~40s (5×8s), exceeding the default 30s test timeout. The test was only fast before because the scanner was aborting prematurely — the exact bug being fixed.
  • Zero logic changes — the scanning loop, timeout mechanism, failure tracking, and log format are all untouched

Testing

  • pnpm build — zero errors
  • pnpm test — all existing tests pass
  • Production-validated: 10+ successful discovery calls across two agents after local patch

Related

…n data

PER_CHUNK_TIMEOUT_MS (3s → 8s) and MAX_CONSECUTIVE_FAILURES (2 → 5) in
getRegisteredAgentsByEvents were set pre-production in PR Conway-Research#228. Production
operation on Base revealed both are too tight: eth_getLogs on recent block
ranges regularly exceeds 3s, and two consecutive timeouts on the newest
chunks (scanned first) aborts the entire scan before reaching older blocks
where agent mint events live.

Updated to production-validated values tested across 10+ successful
discovery calls on two agents. 8s accommodates observed Base RPC latency
with margin. 5 consecutive failures tolerates transient slow chunks
without abandoning the scan.

Also bumped timeout on the discover_agents loop test (180s) since it makes
real RPC calls and the increased per-chunk timeout extends worst-case
execution time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@uibeka
Copy link
Contributor Author

uibeka commented Feb 27, 2026

@unifiedh PR #241 merged and complimentary to this PR. Please review when you get a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant