Skip to content

chore(mcp): capture detached dashboard daemon stdio for diagnostics#40518

Closed
Skn0tt wants to merge 13 commits intomicrosoft:mainfrom
Skn0tt:debug-dashboard-daemon-stdio
Closed

chore(mcp): capture detached dashboard daemon stdio for diagnostics#40518
Skn0tt wants to merge 13 commits intomicrosoft:mainfrom
Skn0tt:debug-dashboard-daemon-stdio

Conversation

@Skn0tt
Copy link
Copy Markdown
Member

@Skn0tt Skn0tt commented Apr 30, 2026

Summary

Diagnostic-only change to surface why tests/mcp/dashboard.spec.ts fails with Cannot read properties of undefined (reading 'endpoint') on windows-latest-firefox runs.

The dashboard daemon is spawned detached with stdio: 'ignore', so chromium launch failures inside it are completely invisible — we only see the downstream connectToDashboard timeout.

This PR opts the daemon into writing stdout+stderr to a per-test log file (via PWTEST_DASHBOARD_DAEMON_LOG), bumps DEBUG=pw:browser* for the cli child env, and attaches the log on test failure.

Not for merge — open as draft so we can read the daemon log from the failing windows-firefox MCP runs.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@Skn0tt Skn0tt force-pushed the debug-dashboard-daemon-stdio branch 2 times, most recently from dfa91ad to ad8d024 Compare April 30, 2026 14:25
@github-actions

This comment has been minimized.

Skn0tt and others added 5 commits May 4, 2026 11:05
Investigating recurring `Cannot read properties of undefined (reading
'endpoint')` failures of `tests/mcp/dashboard.spec.ts` on
windows-latest-firefox runs. The dashboard daemon is currently spawned
detached with `stdio: 'ignore'`, so when its chromium launch fails or
hangs we have zero visibility — only the downstream `connectToDashboard`
timeout shows up.

This change adds an opt-in log file:

- `cli show` (and `--kill`/`--annotate`) now redirect detached daemon
  stdout+stderr to `PWTEST_DASHBOARD_DAEMON_LOG` when set; otherwise
  behavior is unchanged.
- The MCP `cli` test fixture sets that env var to a per-test file under
  the test outputDir, bumps `DEBUG=pw:browser*` so chromium launch
  errors surface, and on test failure attaches the log to the test
  result so it appears in CI artifacts.

Diagnostic only; no production behavior change.
Surfaces .err logs from cli-session browser daemons and the http MCP
server's pw:browser* output on test failure so we can diagnose the
windows-firefox cli-session/http timeouts.
…t race

browser_navigate resolves on the page's load event, which can fire
before fire-and-forget fetches in an inline script complete. Wrap the
assertions in expect(...).toPass() so the snapshot is retried until
all expected requests are visible. Observed flake on windows-msedge.
Daemon .err files live at daemon/${workspaceDirHash}/${session}.err, but the
flat readdir loop only saw the top-level entries, so they were never attached
on failure.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Skn0tt Skn0tt force-pushed the debug-dashboard-daemon-stdio branch from 867a789 to 388d119 Compare May 4, 2026 09:13
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Annotate path was spawning the dashboard daemon with stdio:'ignore',
bypassing the existing log-capture. Use detachedStdio() there too,
and add lifecycle logs around openDashboardApp / acquireSingleton /
innerOpenDashboardApp / bind() so failures on Windows reveal whether
the daemon bailed out via the singleton "already running" path.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Skn0tt added 2 commits May 4, 2026 15:39
The previous diagnostic relied on a daemon log file existing on disk,
but on firefox-windows the file never appeared, leaving us blind to
whether the cli even reached detachedStdio. Attach a debug summary
that includes path/stat of the log file plus stdout/stderr from every
cli invocation, and write a stderr marker before opening the log file
so we can tell whether detachedStdio ran and whether openSync threw.
The dashboard cli's --annotate path runs openDashboardApp in-process
and uses stdout for annotation output (parsed as JSON in some tests).
The previous diagnostic console.log calls in dashboardApp.ts polluted
that stdout, causing many of the dashboard.spec.ts CI failures (e.g.
"SyntaxError: Unexpected token 'd', \"[dashboardA\"... is not valid JSON"
and "expect(received).toMatch(expected)" with received string
"[dashboardApp pid=... openDashboardApp start...").

Switch all diagnostic logs to console.error. Original console.log calls
for annotate output and the "Listening on <url>" port-mode message are
preserved.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

The dashboard daemon uses PLAYWRIGHT_SOCKETS_DIR to acquire a singleton
"already running" lock. The MCP test fixture was setting this dir per
worker, while every other piece of state (registry, daemon dir,
settings file) was per-test. As a result, when the daemon spawned by
test A had not fully exited by the time test B started, test B's
daemon would fail acquireSingleton with "already running" and exit
silently without binding its bindTitle - causing tests to time out
waiting for the bindTitle to appear in the registry.

Diagnostic logging on Windows confirmed:
  [dashboardApp pid=7828] openDashboardApp start, bindTitle=...fe6c87...
  [dashboardApp pid=7828] acquireSingleton failed: Error: already running

Suffixing the sockets dir with testId gives each test its own
singleton namespace, so previous tests' daemons cannot block new ones.
@github-actions

This comment has been minimized.

When the kill RPC is sent, the daemon ACKs by ending its socket and
then runs gracefullyProcessExitDoNotHang(0), which can take up to 30s
(it has to close any open browsers first). Meanwhile the named pipe
or unix socket stays bound to the daemon process - so a caller that
relied on cli show --kill returning to mean the singleton is gone
could race against the next acquireSingleton and get
Error: already running.

Poll the socket after the kill ACK until net.connect errors (nothing
is listening), with a 30s ceiling. On timeout, throw rather than
returning silently - a fast, clear failure surfaces in the right test
and prevents the previous-test-corrupting-the-next-test pattern.

Also revert the per-test PLAYWRIGHT_SOCKETS_DIR workaround so we can
verify whether this fix alone is sufficient on Windows CI.
Skn0tt added 2 commits May 4, 2026 18:08
…quisition

The daemon kill handler did not call server.close(), so the server kept
listening on the socket throughout gracefullyProcessExitDoNotHang (which
can take up to 30s while browsers shut down). A racing acquireSingleton()
in another process would connect successfully, treat the dying daemon as
alive, and bail out with already running.

Fix:
  - Call server?.close() in the kill handler so the binding stops
    accepting new connections immediately, even though the existing kill
    RPC connection drains normally and graceful exit continues.
  - Rewrite acquireSingleton() as a bounded retry loop. Treat any listen
    error (not just EADDRINUSE) as in use and probe the holder via
    connect(); only connect-success counts as alive. This is robust to
    libuv mapping Windows pipe errors to EACCES/EBUSY/ENOENT instead of
    EADDRINUSE.
  - Drop runKillClient socket-release polling: it could not work because
    it observed connect-success while graceful shutdown was in flight.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Test results for "MCP"

1 failed
❌ [chrome] › mcp/autowait.spec.ts:19 › racy navigation destroys context @mcp-windows-latest-chrome

6875 passed, 1015 skipped


Merge workflow run.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Test results for "tests 1"

3 flaky ⚠️ [chromium-library] › library/video.spec.ts:275 › screencast › should capture navigation `@chromium-ubuntu-22.04-node24`
⚠️ [chromium-library] › library/video.spec.ts:719 › screencast › should work with video+trace `@chromium-ubuntu-22.04-node20`
⚠️ [chromium-library] › library/video.spec.ts:719 › screencast › should work with video+trace `@chromium-ubuntu-22.04-node22`

41646 passed, 784 skipped


Merge workflow run.

@Skn0tt
Copy link
Copy Markdown
Member Author

Skn0tt commented May 5, 2026

fix is in #40626

@Skn0tt Skn0tt closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant