Skip to content

fix(telegram): retry polling on transient network errors#976

Open
snemesh wants to merge 1 commit intoanthropics:mainfrom
snemesh:fix/telegram-polling-retry-on-network-errors
Open

fix(telegram): retry polling on transient network errors#976
snemesh wants to merge 1 commit intoanthropics:mainfrom
snemesh:fix/telegram-polling-retry-on-network-errors

Conversation

@snemesh
Copy link
Copy Markdown

@snemesh snemesh commented Mar 25, 2026

Problem

The Telegram plugin's polling loop exits permanently on any error that isn't a 409 Conflict (line 991-992 in server.ts). A single transient network issue — ECONNRESET, ETIMEDOUT, DNS failure, or a Telegram 5xx — silently kills the bot. Messages stop arriving and the user must restart Claude Code to recover.

This is the root cause behind several open issues:

Fix

  • Added isTransientError() helper that classifies errors as retryable: Node.js network codes (ECONNRESET, ECONNREFUSED, ETIMEDOUT, ENETUNREACH, EAI_AGAIN, EPIPE), fetch/TLS failures, and Telegram 5xx server errors.
  • The polling loop now retries with exponential backoff (up to 30s) on transient errors, matching the existing 409 Conflict retry behavior.
  • Permanent errors (401 Unauthorized, invalid token) still cause immediate exit — no infinite retry on bad credentials.
  • The backoff counter resets after a successful onStart callback, so intermittent flakes don't accumulate penalty across reconnections.
  • Added shuttingDown guard to avoid retrying during graceful shutdown.

Changes

  • external_plugins/telegram/server.ts — 43 insertions, 12 deletions

Test plan

  • bun build passes (1.1MB bundle, 0 errors)
  • Unit tests: isTransientError() correctly classifies 19/19 error types (ECONNRESET, ETIMEDOUT, EAI_AGAIN, fetch failed, 5xx → retry; 401, 400, 403, 409, unknown → exit)
  • Integration tests: simulated polling loop verifies retry+recovery for 7 scenarios (network flakes, DNS failure, Telegram 500, 409 conflict, mixed transient→permanent)
  • Live test: bot started with bun server.ts, process suspended with SIGSTOP for 35s to force Telegram to drop the long-poll connection, resumed with SIGCONT — bot recovered via retry loop and resumed polling
Unit test output (19/19 passed)
Testing isTransientError() classification:

  ✅ ECONNRESET           → RETRY
  ✅ ETIMEDOUT            → RETRY
  ✅ ECONNREFUSED         → RETRY
  ✅ EAI_AGAIN            → RETRY
  ✅ ENETUNREACH          → RETRY
  ✅ EPIPE                → RETRY
  ✅ socket hang up       → RETRY
  ✅ fetch failed         → RETRY
  ✅ network error        → RETRY
  ✅ TLS handshake        → RETRY
  ✅ Telegram 500         → RETRY
  ✅ Telegram 502         → RETRY
  ✅ Telegram 429         → EXIT
  ✅ Telegram 401         → EXIT
  ✅ Telegram 400         → EXIT
  ✅ Telegram 403         → EXIT
  ✅ Telegram 409         → EXIT
  ✅ random Error         → EXIT
  ✅ Aborted delay        → EXIT

19/19 tests passed
Integration test output (7/7 passed)
═══ Integration test: polling loop retry behavior ═══

✅ ECONNRESET x3 → recovers on 4th attempt
      0ms  bot.start() threw: read ECONNRESET
      0ms  transient error (attempt 1), retry in 100ms
    101ms  bot.start() threw: read ECONNRESET
    101ms  transient error (attempt 2), retry in 200ms
    302ms  bot.start() threw: read ECONNRESET
    302ms  transient error (attempt 3), retry in 300ms
    603ms  bot.start() succeeded — connected!
    654ms  clean exit (bot.stop called)

✅ DNS EAI_AGAIN x2 → recovers on 3rd attempt
      0ms  bot.start() threw: getaddrinfo EAI_AGAIN api.telegram.org
      0ms  transient error (attempt 1), retry in 100ms
    102ms  bot.start() threw: getaddrinfo EAI_AGAIN api.telegram.org
    102ms  transient error (attempt 2), retry in 200ms
    303ms  bot.start() succeeded — connected!
    354ms  clean exit (bot.stop called)

✅ Telegram 500 x1 → recovers on 2nd attempt
      0ms  bot.start() threw: ISE (500: Internal Server Error)
      0ms  transient error (attempt 1), retry in 100ms
    101ms  bot.start() succeeded — connected!
    152ms  clean exit (bot.stop called)

✅ 409 Conflict x2 → recovers on 3rd attempt (existing behavior preserved)

✅ 401 Unauthorized → exits permanently (no retry)
      0ms  bot.start() threw: Unauthorized (401: Unauthorized)
      0ms  exit: permanent error

✅ ECONNRESET → 401 Unauthorized → exits after transient retry

✅ fetch failed x2 → recovers

═══ Result: ALL PASSED ✅ ═══

Fixes #963

The polling loop currently exits permanently on any error that isn't
a 409 Conflict. A single network hiccup (ECONNRESET, ETIMEDOUT, DNS
failure, Telegram 5xx) kills the bot — messages stop arriving and
the user has to restart Claude Code to recover.

This adds retry-with-backoff for transient errors while preserving
immediate exit on permanent failures (401 Unauthorized, invalid
token). The backoff counter resets after a successful connection
via onStart, so intermittent flakes don't accumulate penalty.

Fixes anthropics#963
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram plugin: polling loop exits permanently on network errors (no retry)

1 participant