Skip to content

Solutions incident management #2481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/solutions/incident-management/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"label": "Incident management",
"collapsible": true,
"collapsed": true,
"customProps": {
"description": "Incident Management"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: "Communicate internally & externally"
sidebar_position: 6
---

# Communicate internally & externally

Great incident response is as much about communication as it is about investigation or remediation. The best teams keep everyone aligned-engineers, stakeholders, and customers—by making communication a first-class part of their incident management process.

<img src="/img/guides/slackIncidentGuide/updatedIncidentEntity.png" border="1px" width="100%" />

## Why communication matters

In high-pressure moments, clear and timely communication prevents confusion, builds trust, and accelerates resolution. Silence or scattered updates can make a bad situation worse.

## How Port helps

- **Automate updates**: Instantly spin up Slack channels, send notifications, and update status pages.
- **Single source of truth**: All communication, context, and updates live with the incident in Port.
- **Internal and external alignment**: Keep your team and your users informed, every step of the way.

## Get started

- [Create Slack Channel for Incident](../../guides/all/create-slack-channel-for-reported-incident)
- [Create and Update Statuspage Incidents](../../guides/all/manage-statuspage-incident)
- [Update ServiceNow Incident](../../guides/all/interact-with-servicenow/)
- [Setup Incident Manager AI Agent](../../guides/all/setup-incident-manager-ai-agent)

:::tip Over-communicate in a Crisis
When in doubt, share more. Frequent, clear updates—internally and externally—reduce stress and build trust during incidents.
:::

Ready to make communication a superpower? Dive into the full guide above and automate your updates with Port.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: Detect & diagnose incidents
sidebar_position: 3
---

# Detect & diagnose incidents

Modern incident management is broken. Too many tools, too many silos, and not enough context. At Port, we believe incident management should be unified, contextual, and automated—so teams can focus on resolution, not wrangling alerts.

<img src="/img/solutions/incident-management/unify_alerts.png" border="1px" width="100%" />

## Why incident management needs to change

Traditional incident management is reactive and fragmented. Alerts come from everywhere, context is missing, and manual processes slow everything down. This leads to longer outages, frustrated teams, and unhappy users.

## The Port approach: unify, enrich, automate

There's a better way:

1. **Unify Alerts**: Bring all your alerts into a single, actionable stream.
2. **Enrich with Context**: Automatically add relevant metadata, ownership, and dependencies to every incident.
3. **Automate Creation**: Trigger incident workflows, notifications, and remediation steps—no manual handoffs.

## How to put this into practice

### Unify alerts

Connect all your monitoring and alerting tools to Port. Our integrations make it easy to centralize alerts from sources like Datadog, PagerDuty, and more.

- [Prometheus](../../build-your-software-catalog/custom-integration/webhook/examples/prometheus)
- [Grafana](../../build-your-software-catalog/custom-integration/webhook/examples/grafana)
- [Datadog Monitors](../../build-your-software-catalog/sync-data-to-catalog/apm-alerting/datadog/examples)
- [Dynatrace Problems](../../build-your-software-catalog/sync-data-to-catalog/apm-alerting/dynatrace)
- [New Relic Issues](../../build-your-software-catalog/sync-data-to-catalog/apm-alerting/newrelic)

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/W4kQ8O2w0WA"
title="Production Readiness Scorecards"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>


### Enrich with context

:::tip Context is everything
Teams that have rich context on incidents resolve them up to 40% faster. Make sure your catalog is up to date!
:::

Every alert in Port is automatically enriched with context of all the related data: who owns the service, what dependencies are involved, and recent changes. This means faster diagnosis and fewer escalations.

[Learn how to build your software catalog](../../getting-started/overview)

## Real-world benefits

- **Fewer False Positives**: Fewer False Positives - meaning incident teams aren't exhausted from alerts "crying wolf".
- **Better Incident Assignment**: The right people are looped in for an incident, based on context, not just a hardcoded automation rule in your APM tooling.
- **Alert Deduplication**: By grouping alerts by other related data (service, team), we can avoid duplicate incidents for our incoming alerts.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: "Operate & resolve incidents"
sidebar_position: 5
---

# Operate & resolve incidents

When incidents strike, the best teams don't just react—they operate with speed, clarity, and confidence. At Port, we believe that incident response should empower engineers to take decisive action, automate away friction, and keep everyone in the loop—all from a single, unified portal.

<img src="/img/solutions/incident-management/operate_incident.png" border="1px" width="100%" />

## Why real-time action matters

Every minute counts during an incident. The longer it takes to acknowledge, escalate, or resolve, the greater the impact on your users and your business. Our most successful customers give their engineers the tools and context to:

- **Acknowledge and take ownership** of incidents instantly
- **Escalate or reassign** when more help is needed
- **Grant temporary permissions** to unblock responders
- **Update stakeholders and status pages** with a click
- **Resolve and close out** incidents, all in one place

## How Port enables best-in-class incident operations

Port brings all your incident actions together, so you can:

- See every incident in context—owners, impacted services, recent changes, and more
- Take action directly from the incident view, with no tab-switching or manual handoffs
- Automate repetitive tasks and approvals, so engineers can focus on resolution

## Step-by-step: what great teams do in Port

### 1. Manage the full incident lifecycle centrally

<img src="/img/guides/pagerDutyDashboard2.png" border="1px" width="100%" />
Take ownership or hand off to the right responder, fast.
- [Acknowledge Incident](../../guides/all/acknowledge-incident)
- [Change PagerDuty Incident Owner](../../guides/all/change-pagerduty-incident-owner)

Loop in additional teams or leadership with a single action.
- [Escalate an Incident](/guides/all/escalate-an-incident)

Wrap up the incident, document what happened, and return to normal.
- [Resolve Incident](/guides/all/resolve-incident)
- [Delete a ServiceNow Incident](/guides/all/delete-servicenow-incident/)

### 2. Grant temporary permissions

Unblock engineers by granting just-in-time access to the systems and data they need to investigate the incident.
- [IAM Permissions Guide](/guides/all/iam-permissions-guide/)
- [Automatically Approve Actions](/guides/all/automatically-approve-action-using-automation/)


:::tip Act Fast, But With Context
The best responders move quickly—but never blindly. Use Port's strong RBAC, self-service and dynamic permissions to enable your incident to gain appropriate access and take actions quickly.
:::

## Real-world benefits

- **Lower MTTR**: Incidents are resolved faster, with less stress
- **Fewer handoffs**: Everything you need is in one place
- **More control**: Engineers are empowered to act, not wait
37 changes: 37 additions & 0 deletions docs/solutions/incident-management/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Overview
sidebar_position: 1
---

# Incident management

## Why manage incidents using a developer portal?

Incident management is much more than just selecting a tool for paging and rotating on-call shifts. It is a discipline that requires almost parallel investigation, communication and reflection - where with little notice, developers are tasked with understanding a problem, identifying a root cause, keeping all internal and external stakeholders aware of what's going on, understanding the impact, resolving the issue and identifying lessons for the future. Quite a tall order. Making this harder is the spread of tools they need to engage with through each step of the process.

Imagine you're going to bed on the first night of your first on-call shift at a new company. Predictably, you receive a phone call from an unknown number an some notoriously unfriendly and robotic sounding voice-to-text programme starts reading you an alert description, one syllable at a time.

Your palms sweat, you open your refurbished macbook pro and start logging into everything all at once:
- Pagerduty to see and acknowledge the event
- Slack to start an incident channel and open a bridge for all those investigating
- Dynatrace to explore the telemetry
- Statuspage to be ready to notify customers of impact
- Github to review recent changes
- ArgoCD to review app sync states
- AWS to be ready to do further investigation around the infrastructure, or take actions to remediate
- Notion to start taking notes I'll later use in a post-mortem

Regardless of whether it's your first on-call or hundredth, the story above highlights the fact that our fragmented toolchains and complex application architecture takes lots of time away from incident triage, investigation and remediation, towards manual tasks around communication and investigation of the incident itself.

![Incident Management Solution Architecture](/img/solutions/incident-management/incident_management_solution_architecture.png)

## How can Port help?

With Port, you will be able to close more incidents with less time and stress, thanks to some key features of the portal:
- Prevent incidents with scorecards that improve your posture and production readiness.
- Technological and business context in the software catalog (like data on past incidents, recent changes, people with subject matter expertise and active company-wide initiatives).
- Time-saving automations and self-service actions (like automated updates of statuspages, requests for privileged access or incident lifecycle management).
- Streamlined approvals and dynamic permissions (optionally gating key decisions and requests, but allowing for escalation or approval by leadership).
- No-code per-role and identity personalization of dashboards (allowing for unique experiences and collaboration between different devs and SREs across different teams and on-call shifts).
- Drive down MTTR, from time saving automation and self-service actions, but even more so as a result of continuous improvement of standards and improvement of engineering practices.

100 changes: 100 additions & 0 deletions docs/solutions/incident-management/prevent-and-minimize-incidents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "Prevent & minimize incidents"
sidebar_position: 2
---

# Prevent & minimize incidents

Incidents are inevitable, but downtime and chaos don't have to be. Resilience is built, not wished for. The best teams prepare, automate, and learn, so when incidents happen, they're ready to bounce back fast.

<img src="/img/guides/productionReadinessMetricsDashboard.png" width="100%" border="1px" />

## Why resilience matters

Modern systems are complex. Outages, misconfigurations, and human error are facts of life. The difference between high-performing teams and those that lose trust with their customers? Resilience: the ability to prevent incidents, and to recover quickly when they do occur.

## Readiness = prepare + automate + learn

We think resilience is an evergreen initiative, an elusive and unreachable target and not a one-time project. Our most successful customers focus on a 3 part cycle, in which one:

1. **Prepares**: Map your systems, dependencies, and ownership so you know what's important, long before the first incident strikes. Build a meaningful catalog, to assist your human and AI incident responders.
2. **Automates**: Set up proactive checks, alerts, and self-healing workflows. Empower your human and AI incident responders to self-serve their way through investigation, remediation and communication around the incident, without need for tickets or approvals.
3. **Learns**: Capture incident data and feed it back into your processes for continuous improvement. The best teams generate their retrospective documentation and index it to prevent lengthy recurreces of the same incidents (greatly decreasing Mean Time to Resolution). Chase remediation tasks and improve the resilience of the system for a meaningful change to SLOs and MTBF (Mean Time Between Failure)

A focus on learning from the past and a culture of continuous improvement can switch your teams from being reactive and overburdened to proactive and confident.

<img src="/img/ai-agents/AIAgentRCAResponse3.png" border="1" width="100%" />

## How to build resilience with Port

### Prepare: map your world

A clear, up-to-date software catalog is the foundation of resilience. Know what you have, who owns it, and how everything connects.

[Start building your software catalog](/getting-started/overview)

Some of the data you should be cataloging, to prepare for the incidents to come:

- **Services** - all the internal services that R&D owns, from code to cloud. Map the source repositories,
- **Ownership** - every service should be owned, and the ownership should be clear for anyone to observe. The good news is that with our ownership inheritance, you can take the ownership from a service and inherit this across the graph of all related entities in your catalog
- **Environments** - it's important to account for all places that the code can run! Not just development, staging and production. Model every tenant. Make sure that there is no place the code can run (or fail), that isn't modelled and governed by Port
- **Features** - model your features and feature flags, to assist your human and AI agents with understanding the impact of incidents in terms of features, or the possibility of using feature flags to remediate issues during investigation
- **Customers** - track your customers, their access to different environments, features and entitlements

Take a look at what's possible when everything is connected. Here's a real world example from a customer using our MCP server:

<img src="/img/solutions/incident-management/mcp_incident_impact.png" border="1" width="40%" />

### Track: monitor the metrics that matter

Many of our customers leverage Mean Time to Recovery as a golden metric for incident management.
We recommend [tracking MTTR, along with all the DORA metrics](/guides/all/create-and-track-dora-metrics-in-your-portal), but not stopping there.

Some teams become really adept at closing our incidents quickly, and engineering leadership becomes interested in focusing on improving technical debt associated with the flakiest components in the system. Tracking the Mean Time Between Failures (MTBF) can be a great metric to show the operational point in time and historical health of each component.

### Automate: proactive checks and self-healing

Don't wait for things to break. Use Port to automate health checks, enforce best practices, and trigger remediation workflows before users are impacted.

#### Scorecards

Scorecards allow your central governance functions to define what good looks like for all non-functional aspects of software development. AppSec teams can create Application Security scorecards, SRE teams can define Reliability scorecards and your Platform teams can create scorecards around the needs of the development platform (Ownership, Tagging, Pipeline Health, Supply Chain Health).

Our most successful customers define Production Readiness scorecards. Think about Production Readiness as almost a breadth-first approach to covering many needs across a variety of scorecards. Production Readiness for example:
- may not include all layers of an Application Security scorecard, but it may include a check that security tools are configured and that there are no critical vulnerabilities
- may not include all expectations of an Ownership scorecard, but may enforce an expectation that the service being deployed has an owning team configured
- may not include all rules from a Reliability scorecard, but may enforce the presence of one availability SLO for the service being deployed

Leverage a Production Readiness scorecard to make sure that your incident response teams have covered the baseline needs of what will be required to respond to an incident.

- [Ensure Production Readiness](/guides/all/ensure-production-readiness)
- [See our scorecard guides](/promote-scorecards)

#### Track SLOs and SLIs

Beyond Production readiness and other deeper, more specialized scorecards, consider tracking all SLOs and SLIs for your services in Port, for a better view of your Reliability posture. Configure the AI Incident Manager Agent to assist with exploring and learning from the incidents from the past.

- [Track SLOs and SLIs for Services](/guides/all/track-slos-and-slis-for-services)

#### Have AI assist with building resilience

Whether it's finding out about incident resolution trends, or discovering more about current on-call assignments and active incidents, an AI Incident Manager Agent can help you.

- [Configure the AI Incident Manager Agent](/guides/all/setup-incident-manager-ai-agent)

:::caution Don't skip ownership
Resilience depends on clear ownership. Make sure every service and component in your catalog has an owner—otherwise, incidents will fall through the cracks.
:::

### Learn: close the loop

After every incident, use Port to capture what happened, analyze root causes, and update your processes. Continuous learning is the secret to long-term resilience.

- [Add RCA Context to AI Agents](/guides/all/add-rca-context-to-ai-agents)

## Real-world benefits

- **Fewer incidents**: Proactive checks and clear ownership prevent problems before they start.
- **Faster recovery**: When incidents do happen, you know exactly who to call and what to fix.
- **Continuous improvement**: Every incident makes your system stronger.

Loading