incident-management: tighten IR template structure and pipeline runbook#424
Conversation
built with Refined Cloudflare Pages Action⚡ Cloudflare Pages Deployment
|
Second pass updateThis second pass keeps the same philosophy as the first one:
What changed in this pass This pass focuses on the two runbooks that still felt materially underpowered:
1) Strengthened frontend-compromiseThis page now better reflects how frontend incidents actually behave in practice, especially in Web3 where a frontend compromise often becomes a user-signing or Changes include:
The goal here was to make the page more useful during the first minutes of an actual incident, not just more complete on paper. 2) Strengthened dependency-attackThis page was still too close to a stub. It now better distinguishes between a generic vulnerable package and a dependency incident that may have affected real build Changes include:
What I intentionally did not change Still intentionally out of scope:
For example, I left key-compromise unchanged in this pass rather than make lower-confidence edits. Why this is the last pass At this point, the highest-value weak spots in the imported IR template section have been addressed without turning the PR into a broad rewrite. This keeps the contribution focused on:
|
|
Left some comments @mattaereal |
9d348ab to
d8165b7
Compare
|
@frameworks-volunteer can you address @scode2277's comments, solving the conversation and potential conflicts? Take all her feedback seriously. |
d8165b7 to
43f0e45
Compare
|
Model: Addressed all of @scode2277 feedback and resolved merge conflicts with develop. Here is what was done:
|
scode2277
left a comment
There was a problem hiding this comment.
gtg for me @mattaereal!
frameworks-volunteer
left a comment
There was a problem hiding this comment.
Model: moonshotai/kimi-k2.6 Reasoning: high Provider: openrouter
Security: clean. No secrets, injection vectors, or unsafe patterns. Content is MDX docs only.
QA: all checks pass.
- CI green (build, lint, sidebar-reminder)
- All internal link targets exist and resolve correctly
- Frontmatter valid and consistent across files
- No leftover debug content, TODOs, or placeholder text
- "Stub runbook" language properly replaced with "Example runbook"
Substantive notes:
-
Build pipeline compromise runbook: major upgrade from thin stub to credible operational document. The addition of scope questions, differentiation from adjacent incident classes, evidence preservation, blast-radius-ordered credential rotation, and a verification gate before resuming delivery are all well-chosen. The three recovery options (rebuild / rollback / pause) with explicit "When" and "Impact" framing is practical.
-
Dependency attack runbook: similarly upgraded with proper scope questions, differentiation (correctly cross-linking to build-pipeline-compromise and frontend-compromise), and a "Verification Before Resuming" gate. Good that it distinguishes malicious vs vulnerable packages.
-
Frontend compromise runbook: tightened throughout. The user warning message template is improved ("If you have not signed new transactions, your funds in the protocol remain unaffected" adds important clarity). Evidence preservation moved before cleanup is the right call. The trust boundary failure step is a useful addition.
-
Taxonomy clarifications across overview pages are concise and well-placed -- they give readers a mental model for framework guidance vs templates vs runbooks vs playbooks without being repetitive.
-
Policy/staffing softening ("Monitor based on residual risk..." and "24/7 escalation path") are more realistic and less doctrinal. Good changes.
-
One minor note: templates/overview.mdx line "use a template" self-links to the current page. Not broken, but slightly odd for a directory page. Very low priority.
Approving -- this is a solid, well-scoped first pass that delivers exactly what it promises.
| - [ ] CI/CD configuration changed without approval | ||
| - [ ] Secrets accessed or exfiltrated | ||
| - [ ] Unauthorized workflow runs | ||
| - [ ] Unexpected workflow runs or releases |
There was a problem hiding this comment.
These probably should be list items, as it's a bit unclear what is the purpose of a checklists here. Will team need to click them through? Why? I see that they were checklists before, it probably slipped through a previous review iteration
| - [ ] Deployments reference an unexpected commit, artifact, or builder identity | ||
|
|
||
| ### Confirm Compromise | ||
| ### Likely Scope Questions |
There was a problem hiding this comment.
Even though these questions are meaningful, it's a bit unclear how do they align with the purpose of this document? Is that a step to follow? Who should follow? Why do they go before Immediate actions?
| - Did the pipeline have deploy permissions, signing authority, or production credentials? | ||
| - Were any releases, containers, frontend bundles, or packages published during the exposure window? | ||
|
|
||
| ### Differentiation |
There was a problem hiding this comment.
Feels like this section is excessive in runbook, but belongs to some educational material/policy
| 2. [ ] Rotate all secrets and tokens | ||
| 3. [ ] Take down potentially compromised deployments | ||
| 4. [ ] Audit recent builds and deployments | ||
| ### Step 1: Freeze the pipeline |
There was a problem hiding this comment.
This doesn't say anything regarding keys revocation/rotation, as some keys may be used to push & approve
| - [ ] Revoke or pause auto-deploy jobs | ||
| - [ ] Block manual approvals until scope is understood | ||
|
|
||
| ### Step 2: Preserve evidence |
There was a problem hiding this comment.
This is too excessive for "Immediate Actions" most of these evidences can be collected later, it's too inefficient to do that during an incident itself, when we need to limit a damage as fast as we can
| These playbooks are reference material: they help teams think through common incident types, decision points, and | ||
| response patterns. They are not drop-in internal operating procedures. | ||
|
|
||
| For copy-and-adapt operational documentation, see |
| thinking about incident management prior to actually experiencing an incident, you can help increase the likelihood of a | ||
| timely recovery. | ||
|
|
||
| This framework contains two different kinds of content: |
| - [ ] Lockfile changes you didn't make | ||
| - [ ] Malicious code found in installed dependencies or build output | ||
| - [ ] Lockfile changes you did not expect | ||
| - [ ] Frontend bundle or released artifact changed more than the source diff would explain |
There was a problem hiding this comment.
Dependency can be in different parts of a system, not only frontend
| - [ ] UI behaves differently than expected | ||
| - [ ] Wallet drainer behavior detected | ||
| - [ ] Injected scripts or unexpected external resources appear in page source | ||
| - [ ] Official domain or subdomain resolves unexpectedly |
There was a problem hiding this comment.
This isn't 1 to 1 migration, as potential issue can be due to MX change or smth like that, or even NS change, which doesn't require resolution of IP change right away
|
|
||
| These people should be reachable 24/7 for critical incidents. Consider: | ||
|
|
||
| There should be a 24/7 escalation path to these people for critical incidents. Consider: |
There was a problem hiding this comment.
Why was it changed to escalation path? Not sure if it makes a lot of sense
Summary
This PR is a first pass on the recently added Incident Response Template section.
The goal is not to expand the section broadly, but to make it clearer, tighter, and more operationally credible without adding filler or speculative content.
This pass focuses on three things:
What changed
1) Clarified content taxonomy
Added concise framing so readers can understand what each layer is for:
2) Tightened a few absolute statements
These changes are meant to make the guidance more realistic and less doctrinal.
3) Upgraded the build pipeline compromise runbook
incident-response-template/runbooks/build-pipeline-compromise.mdx was previously a thin stub. This PR upgrades it into a more credible example runbook by adding:
What this PR does not do
Intentionally out of scope for this first pass:
I would rather leave gaps visible than fill them with weak or speculative guidance.
Why this scope
The Incident Response Template addition is already valuable, but right now it mixes:
This first pass tries to make that structure easier to understand, while also strengthening one page that felt materially underdeveloped.
Follow-up ideas (not included here)
Possible future passes, if useful: