thetechhustle · bdorlus · May 18, 2026 · May 14, 2026
diff --git a/docs/lessons/31_methodology_policy_and_politics/31.7_service_level_agreements.md b/docs/lessons/31_methodology_policy_and_politics/31.7_service_level_agreements.md
@@ -1,56 +1,179 @@
 ## 31.7 Service Level Agreements (SLAs)
 
+Service level agreements define what a service is expected to provide, how that expectation is measured, and what happens when the target is missed.
 
+For a Linux operator, an SLA is not just contract language. It shapes monitoring thresholds, escalation rules, maintenance windows, backup objectives, incident communication, and the evidence you gather when a system is unhealthy.
 
 !!! abstract "What you will learn"
-    - Explain where **31.7 Service Level Agreements (SLAs)** fits in day-to-day Linux operations.
-    - Use current Linux tooling to inspect, change, and verify the relevant system behavior.
-    - Connect the concept to a real operational scenario: an operations team trying to become predictable without becoming bureaucratic.
-
-!!! example "Field story"
-    Imagine an operations team trying to become predictable without becoming bureaucratic. Your job is not to memorize a command; it is to build a short evidence trail, choose a low-risk change, and prove whether the system improved.
+    - Explain the difference between an SLA, SLO, SLI, and internal operating target.
+    - Identify Linux evidence that supports availability, latency, backup, and response commitments.
+    - Translate vague service promises into measurable operational checks.
+    - Avoid SLA promises that Linux operators cannot actually prove or control.
 
 !!! success "Operator principle"
-    Policies are useful when they make good work easier and risky work harder.
+    An SLA is only useful when it can be measured, explained, and acted on during normal operations and incidents.
 
-## Hands-on practice
+## SLA vocabulary
 
-Run these on a disposable VM, container, or lab machine unless the lesson explicitly says otherwise.
+The terms around service levels are often mixed together. Keep them separate:
 
-1. Inspect the current state with a read-only command related to this topic.
-2. Save the command and output in a short lab note.
-3. Make one reversible change or simulate the change in a sandbox.
-4. Re-run the inspection and explain what changed.
+- **Service Level Agreement (SLA)**: the formal promise between a provider and a customer or business unit. It often includes remedies, reporting expectations, and escalation paths.
+- **Service Level Objective (SLO)**: the target the team tries to meet, such as "99.9% monthly availability" or "95% of web requests complete in under 300 ms."
+- **Service Level Indicator (SLI)**: the actual measurement used to judge performance, such as HTTP success rate, request latency, backup success, or ticket response time.
+- **Operating target**: an internal threshold that helps the team protect the SLA, such as "page the on-call engineer when disk usage reaches 85%."
 
-## Check your understanding
+An SLA might say that a service must be available 99.9% of the time. The SLO might define which endpoints count. The SLI is the data collected from load balancers, monitoring systems, logs, or synthetic checks. Operating targets are the alerts and runbooks that keep the service inside the target.
+
+## Common SLA areas
+
+Most infrastructure SLAs include a small set of measurable promises:
+
+- **Availability**: whether users can reach the service during the agreed measurement window.
+- **Performance**: whether the service responds within an agreed latency or throughput range.
+- **Support response**: how quickly the team acknowledges and starts work on an incident or request.
+- **Resolution or workaround**: how quickly service is restored, mitigated, or escalated.
+- **Backup and recovery**: how much data loss is acceptable and how quickly restore should happen.
+- **Maintenance notice**: how much warning users receive before planned downtime.
+- **Security and compliance reporting**: how incidents, access requests, and audit evidence are handled.
+
+Not every service needs all of these. A public customer-facing API usually needs strict availability and latency targets. A small internal reporting server might need business-hours support, backup restore expectations, and clear maintenance notice.
+
+## Linux evidence for SLA reporting
+
+Linux systems can provide useful evidence, but only if you know what question you are answering.
+
+| SLA question | Useful Linux evidence |
+| --- | --- |
+| Was the host running? | `uptime`, `last reboot`, `journalctl --list-boots`, cloud or hypervisor events |
+| Was the service process healthy? | `systemctl status SERVICE`, `systemctl show SERVICE`, process supervision history |
+| Were users receiving errors? | web server logs, application logs, load balancer status codes, synthetic monitoring |
+| Was the host resource-starved? | CPU load, memory pressure, disk latency, filesystem fullness, network errors |
+| Did a backup complete? | backup job logs, repository snapshots, restore test records |
+| Was there a recent change? | ticket history, deployment logs, package manager logs, configuration management runs |
+| Was the team notified? | alert records, paging history, incident channel timestamps, ticket comments |
+
+Host uptime alone is not a good availability metric. A server can be up while the application is down, the database is unreachable, DNS is wrong, a certificate is expired, or a firewall blocks users.
+
+!!! warning "Measure the user-visible service"
+    If the SLA is about a website, measure the website from outside the host. If the SLA is about backups, measure successful restore, not only successful backup job exit codes. Linux host evidence supports the report, but it does not replace service-level measurement.
+
+## Turning promises into checks
+
+Vague promises create conflict during incidents. Replace them with measurable statements.
+
+Weak promise:
+
+> The application will be fast and reliable.
+
+Better promise:
+
+> During business hours, the `/health` endpoint must return HTTP 200 from two external regions for 99.9% of five-minute checks per calendar month. Planned maintenance announced at least three business days in advance is excluded.
 
-- What evidence would tell you that this system is healthy?
-- What is the riskiest command in this lesson, and how would you make it safer?
-- How would you explain section 31.7 to a teammate during an incident handoff?
+That better version answers key questions:
 
+- What endpoint counts?
+- Who measures it?
+- How often is it measured?
+- What time period is used?
+- What downtime is excluded?
+- What must be communicated before planned work?
 
-Service Level Agreements, or SLAs, are pivotal in defining the professional relationship between IT providers and their clients.📝
+For Linux operators, this clarity drives practical work: monitoring probes, alert routing, maintenance planning, capacity checks, and incident notes.
 
-An SLA is essentially a contract that itemizes the level of service that a customer can expect from a provider. The aim of the SLA is to provide transparency, practical expectations, and build trust between parties. It keeps everyone on the same page 📘 and reduces the number of surprises or miscommunications.
+## Example: mapping an SLA to operator checks
 
-Let's dissect an SLA to understand its core components better:
+Suppose an internal wiki has these targets:
 
-1. **Introduction**: This section 📜 outlines the agreement's general terms, the parties involved, and any defining objectives.
+- Available Monday-Friday, 08:00-18:00 local time.
+- Incident acknowledgement within 30 minutes during business hours.
+- Nightly backup with no more than 24 hours of data loss.
+- Restore test completed every quarter.
 
-2. **Service Definition**: Here, services are fleshed out in detail. This could include the range of tasks, the scope of maintenance, and support that the provider agrees to offer.
+Reasonable Linux-side checks might include:
 
-3. **Performance Metrics**: Key Performance Indicators (KPIs) play a major role here. KPIs are quantitative measures used to evaluate the success of a service or activity. Metrics like latency, uptime, response time ⏰, and more can define the efficiency of a service.
+```bash
+systemctl status nginx
+systemctl status postgresql
+ss -ltnp
+df -h
+journalctl -u nginx --since "1 hour ago"
+journalctl -u postgresql --since "1 hour ago"
+grep -i "backup" /var/log/syslog | tail -20
+```
 
-4. **Problem Management**: Specifies the steps to be taken in case of a service disruption. Details can vary, but it typically includes problem identification 👀, response time, and expected resolution timeline.
+The actual SLA report should still come from service monitoring, ticket timestamps, backup records, and restore-test notes. The commands above help explain what happened on the host when the SLA was at risk.
 
-5. **Duties and Responsibilities**: The agreement will delineate what is expected from both the provider and the customer. This is the blueprint for professional interaction🤝 and cooperation.
+## Maintenance windows and exclusions
 
-6. **Review and Adjustment Clause**: A periodic review of the SLA is critical. It's a chance for both parties to sit down and discuss potential amendments or improvements to the agreement.
+SLAs should say what does and does not count as downtime. Common exclusions include:
 
-7. **Termination Clause**: This part details the terms of agreement termination 🚧. This could be due to violation of the contract, failure to meet requirements, or a simple decision to end services.
+- planned maintenance announced within the required notice period
+- customer-caused outages, such as expired credentials or unsupported client behavior
+- upstream provider outages outside the team's control
+- emergency security maintenance
+- force majeure or other contract-specific exceptions
 
-Writing an SLA requires careful thinking and precise language. Similarly, reading and understanding ones needs a discerning eye 👁️. But, with consistent practice, you'll develop the necessary wisdom to utilize SLAs effectively.
+Exclusions are not excuses to be sloppy. They are boundaries that prevent every incident from turning into an argument about definitions. Operators should still record the timeline, user impact, change ticket, validation checks, and rollback decision.
 
-A well-drafted SLA leads to productive and harmonious working relationships, helps manage expectations, and ensures services render smoothly. So, while it can be dense and legalistic, it's a tool worth mastering in your IT arsenal.
+## Error budgets
+
+An error budget is the amount of unreliability a service can spend while still meeting its target.
+
+For example, 99.9% monthly availability allows roughly 43 minutes of downtime in a 30-day month. If a service has already spent most of that budget, the team should slow risky changes, prioritize reliability work, and review alert quality.
+
+Error budgets help keep reliability practical. They let teams ship changes when the service is healthy and tighten controls when reliability is slipping.
+
+## Common SLA mistakes
+
+Watch for these failure patterns:
+
+- promising "100% uptime" without the architecture, staffing, and budget to support it
+- measuring only host uptime instead of user-visible service health
+- excluding planned maintenance without defining notice requirements
+- setting response targets without on-call coverage or escalation paths
+- committing to restore times that have never been tested
+- using monitoring that depends on the same failed network path as the service
+- writing reports that lack timestamps, evidence, or customer impact
+- confusing an alert threshold with a customer-facing SLA
+
+The fix is usually not more paperwork. It is sharper definitions, better evidence, and routine practice.
+
+## Incident notes for SLA review
+
+When an incident might affect an SLA, capture facts while they are fresh:
+
+```text
+Service:
+SLA/SLO at risk:
+Start time:
+Detection source:
+Customer impact:
+Current status:
+Linux evidence checked:
+Recent changes:
+Mitigation or rollback:
+Notification sent:
+Recovery time:
+Follow-up owner:
+```
+
+Keep the note factual. Avoid blame and speculation. A good SLA review explains impact, timeline, root cause, recovery, and what will change before the next incident.
+
+## Hands-on practice
+
+Use a disposable VM, container, or lab service.
+
+1. Pick a simple service, such as a local web server or SSH service.
+2. Write one measurable SLO for it. Include the measurement window and what counts as failure.
+3. Choose at least two SLIs that would prove whether the SLO was met.
+4. Run Linux commands that support the measurement, such as `systemctl status`, `ss -ltnp`, `journalctl`, or web server log checks.
+5. Write a short incident note for a simulated outage.
+6. Add one operating target that would warn the team before the SLA is missed.
+
+## Check your understanding
 
-Remember, embracing the 'boring' like SLAs could turn out to be exponentially beneficial 💰 for your professional advancement. Life in Linux isn't just about the cool code, but also about the essential paperwork.
+- Why is host uptime usually insufficient for an application availability SLA?
+- What is the difference between an SLA and an SLO?
+- Which Linux commands would help explain a missed web-service latency target?
+- What backup evidence would you want before promising a restore-time target?
+- How can planned maintenance be handled without making SLA reports misleading?