What: Eliminate resolveServerScript() fallback chain entirely — bundle server.ts into the compiled browse binary.
Why: The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.
Context: Bun's --compile flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.
Effort: M Priority: P2 Depends on: None
What: Isolated browser instances with separate cookies/storage/history, addressable by name.
Why: Enables parallel testing of different user roles, A/B test verification, and clean auth state management.
Context: Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.
Effort: L Priority: P3
What: Record browser interactions as video (start/stop controls).
Why: Video evidence in QA reports and PR bodies. Currently deferred because recreateContext() destroys page state.
Context: Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.
Effort: M Priority: P3 Depends on: Sessions
What: AES-256-GCM support for future Chromium cookie DB versions (currently v10).
Why: Future Chromium versions may change encryption format. Proactive support prevents breakage.
Effort: S Priority: P3
What: Save/load cookies + localStorage to JSON files for reproducible test sessions.
Why: Enables "resume where I left off" for QA sessions and repeatable auth states.
Effort: M Priority: P3 Depends on: Sessions
What: Encrypted credential storage, referenced by name. LLM never sees passwords.
Why: Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.
Effort: L Priority: P3 Depends on: Sessions, state persistence
What: frame <sel> and frame main commands for cross-frame interaction.
Why: Many web apps use iframes (embeds, payment forms, ads). Currently invisible to browse.
Effort: M Priority: P4
What: find role/label/text/placeholder/testid with attached actions.
Why: More resilient element selection than CSS selectors or ref numbers.
Effort: M Priority: P4
What: set device "iPhone 16 Pro" for mobile/tablet testing.
Why: Responsive layout testing without manual viewport resizing.
Effort: S Priority: P4
What: Intercept, block, and mock network requests.
Why: Test error states, loading states, and offline behavior.
Effort: M Priority: P4
What: Click-to-download with path control.
Why: Test file download flows end-to-end.
Effort: S Priority: P4
What: --max-output truncation, --allowed-domains filtering.
Why: Prevent context window overflow and restrict navigation to safe domains.
Effort: S Priority: P4
What: WebSocket-based live preview for pair browsing sessions.
Why: Enables real-time collaboration — human watches AI browse.
Effort: L Priority: P4
What: Connect to already-running Chrome/Electron apps via Chrome DevTools Protocol.
Why: Test production apps, Electron apps, and existing browser sessions without launching new instances.
Effort: M Priority: P4
What: GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.
Why: Cross-platform cookie import. Currently macOS-only (Keychain).
Effort: L Priority: P4
What: Append structured JSON entry to .gstack/ship-log.json at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).
Why: /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.
Context: /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.
Effort: S Priority: P2 Depends on: None
What: After push, browse staging/preview URL, screenshot key pages, check console for JS errors, compare staging vs prod via snapshot diff. Include verification screenshots in PR body. STOP if critical errors found.
Why: Catch deployment-time regressions (JS errors, broken layouts) before merge.
Context: Requires S3 upload infrastructure for PR screenshots. Pairs with visual PR annotations.
Effort: L Priority: P2 Depends on: /setup-gstack-upload, visual PR annotations
What: /ship Step 7.5: screenshot key pages after push, embed in PR body.
Why: Visual evidence in PRs. Reviewers see what changed without deploying locally.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
What: /ship and /review post inline review comments at specific file:line locations using gh api to create pull request review comments.
Why: Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.
Context: GitHub supports inline review comments via gh api repos/$REPO/pulls/$PR/reviews. Pairs naturally with Phase 3.6 visual annotations.
Effort: S Priority: P2 Depends on: None
What: Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.
Why: Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.
Context: Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.
Effort: S Priority: P2 Depends on: Enough FP data accumulated (10+ entries)
What: /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.
Why: Visual diff catches layout regressions that code review misses.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
What: Compare baseline.json over time, detect regressions across QA runs.
Why: Spot quality trends — is the app getting better or worse?
Context: QA already writes structured reports. This adds cross-run comparison.
Effort: S Priority: P2
What: /qa as GitHub Action step, fail PR if health score drops.
Why: Automated quality gate in CI. Catch regressions before merge.
Effort: M Priority: P2
What: After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.
Why: Reduces friction for repeat users.
Effort: S Priority: P2
What: --a11y flag for focused accessibility testing.
Why: Dedicated accessibility testing beyond the general QA checklist.
Effort: S Priority: P3
What: Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.
Why: Retro should include production health alongside code metrics.
Context: Requires browse integration. Screenshots + metrics fed into retro output.
Effort: L Priority: P3 Depends on: Browse sessions
What: Configure S3 bucket for image hosting. One-time setup for visual PR annotations.
Why: Prerequisite for visual PR annotations in /ship and /review.
Effort: M Priority: P2
What: browse/bin/gstack-upload — upload file to S3, return public URL.
Why: Shared utility for all skills that need to embed images in PRs.
Effort: S Priority: P2 Depends on: /setup-gstack-upload
What: ffmpeg-based WebM → GIF conversion for video evidence in PRs.
Why: GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.
Effort: S Priority: P3 Depends on: Video recording
What: Lightweight post-deploy smoke test: hit key URLs, verify 200s, screenshot critical pages, console error check, compare against baseline snapshots. Pass/fail with evidence.
Why: Fast post-deploy confidence check, separate from full QA.
Effort: M Priority: P2
What: Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
Why: CI integration catches quality regressions before merge and provides persistent eval records per PR.
Context: Requires ANTHROPIC_API_KEY in CI secrets. Cost is ~$4/run. Eval persistence system (v0.3.6) writes JSON to ~/.gstack-dev/evals/ — CI would upload as GitHub Actions artifacts and use eval:compare to post delta comment.
Effort: M Priority: P2 Depends on: Eval persistence (shipped in v0.3.6)
What: Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
Why: Reduce E2E test cost and flakiness.
Effort: XS Priority: P2
What: bun run eval:dashboard serves local HTML with charts: cost trending, detection rate, pass/fail history.
Why: Visual charts better for spotting trends than CLI tools.
Context: Reads ~/.gstack-dev/evals/*.json. ~200 lines HTML + chart.js via Bun HTTP server.
Effort: M Priority: P3 Depends on: Eval persistence (shipped in v0.3.6)
What: Run /qa as a GitHub Action step, fail PR if health score drops below threshold.
Why: Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
Context: Requires headless browse binary available in CI. The /qa skill already produces baseline.json with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need ANTHROPIC_API_KEY in CI secrets since /qa uses Claude.
Effort: M Priority: P2 Depends on: None
What: Use Chrome DevTools Protocol DOM.documentUpdated / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit snapshot call.
Why: Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
Context: Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
Effort: L Priority: P3 Depends on: Ref staleness Parts 1+2 (shipped)
- Rename to gstack
- Restructure to monorepo layout
- Setup script for skill symlinks
- Snapshot command with ref-based element selection
- Snapshot tests Completed: v0.2.0
- Annotated screenshots, snapshot diffing, dialog handling, file upload
- Cursor-interactive elements, element state checks
- CircularBuffer, async buffer flush, health check
- Playwright error wrapping, useragent fix
- 148 integration tests Completed: v0.2.0
- /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
- Issue taxonomy, severity classification, exploration checklist
- Report template, health score rubric, framework detection
- wait/console/cookie-import commands, find-browse binary Completed: v0.3.0
- cookie-import-browser command (Chromium cookie DB decryption)
- Cookie picker web UI, /setup-browser-cookies skill
- 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge) Completed: v0.3.1
- Track cumulative API spend, warn if over threshold Completed: v0.3.6
- Config CLI (
bin/gstack-config), auto-upgrade via~/.gstack/config.yaml, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade Completed: v0.3.8