diff --git a/docs/tutorials/policy-as-code/01-your-first-policy.md b/docs/tutorials/policy-as-code/01-your-first-policy.md
index 17c03462..8ee72bbf 100644
--- a/docs/tutorials/policy-as-code/01-your-first-policy.md
+++ b/docs/tutorials/policy-as-code/01-your-first-policy.md
@@ -1,3 +1,6 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
 # Chapter 1: Your First Policy
 
 In this chapter you will write a YAML policy file that blocks dangerous agent
diff --git a/docs/tutorials/policy-as-code/02-capability-scoping.md b/docs/tutorials/policy-as-code/02-capability-scoping.md
index 246dab74..241752d0 100644
--- a/docs/tutorials/policy-as-code/02-capability-scoping.md
+++ b/docs/tutorials/policy-as-code/02-capability-scoping.md
@@ -1,3 +1,6 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
 # Chapter 2: Capability Scoping
 
 In Chapter 1 you created a single policy that applies to every agent. In
diff --git a/docs/tutorials/policy-as-code/03-rate-limiting.md b/docs/tutorials/policy-as-code/03-rate-limiting.md
index 66a3fa31..5da0e75e 100644
--- a/docs/tutorials/policy-as-code/03-rate-limiting.md
+++ b/docs/tutorials/policy-as-code/03-rate-limiting.md
@@ -1,3 +1,6 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
 # Chapter 3: Rate Limiting
 
 An agent with the right permissions can still cause problems if it runs out of
diff --git a/docs/tutorials/policy-as-code/04-conditional-policies.md b/docs/tutorials/policy-as-code/04-conditional-policies.md
index 1e3b4097..2f4bea17 100644
--- a/docs/tutorials/policy-as-code/04-conditional-policies.md
+++ b/docs/tutorials/policy-as-code/04-conditional-policies.md
@@ -1,3 +1,6 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
 # Chapter 4: Conditional Policies
 
 In Chapters 1-3 each policy stood on its own — one file, one evaluator, one
diff --git a/docs/tutorials/policy-as-code/05-approval-workflows.md b/docs/tutorials/policy-as-code/05-approval-workflows.md
index c78dea20..b3a3e843 100644
--- a/docs/tutorials/policy-as-code/05-approval-workflows.md
+++ b/docs/tutorials/policy-as-code/05-approval-workflows.md
@@ -463,5 +463,5 @@ every tool gets the right decision in every environment. That is policy
 testing.
 
 **Previous:** [Chapter 4 — Conditional Policies](04-conditional-policies.md)
-**Next:** Chapter 6 — Policy Testing (coming soon) — verify that every
+**Next:** [Chapter 6 — Policy Testing](06-policy-testing.md) — verify that every
 policy rule works correctly, automatically.
diff --git a/docs/tutorials/policy-as-code/06-policy-testing.md b/docs/tutorials/policy-as-code/06-policy-testing.md
new file mode 100644
index 00000000..d7568a1c
--- /dev/null
+++ b/docs/tutorials/policy-as-code/06-policy-testing.md
@@ -0,0 +1,604 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
+# Chapter 6: Policy Testing
+
+In Chapters 1–5, you checked your policies by running a script and eyeballing
+the output. That works when you have five rules. But you now have role-based
+policies, environment-aware rules, conflict resolution, and escalation
+workflows. A single typo in a YAML file can silently change an escalation into
+a hard deny — and nobody notices until a real transfer fails in production.
+
+Manual checking does not scale. You need **automated tests** that verify every
+tool gets the right decision, for every role, every time.
+
+**What you'll learn:**
+
+| Section | Topic |
+|---------|-------|
+| [The problem](#the-problem) | Why eyeballing output is not enough |
+| [Validate the structure](#step-1-validate-the-structure) | Catch structural errors before anything runs |
+| [Write test scenarios](#step-2-write-test-scenarios) | Declare expected outcomes, run them automatically |
+| [The test matrix](#step-3-the-test-matrix) | Combine policies from chapters 2 + 4, test every role × environment × tool |
+| [Catch a regression](#step-4-catch-a-regression) | Find the bug that manual checking misses |
+| [Try it yourself](#try-it-yourself) | Exercises |
+
+---
+
+## The problem
+
+Manual checking breaks down fast. Once you have multiple policies, you need a
+repeatable way to say "for this context, I expect this decision" and verify the
+result automatically.
+
+---
+
+## Step 1: Validate the structure
+
+Before testing any decisions, make sure the YAML is well-formed. A misspelled
+operator or a missing field will cause confusing failures later. Catch
+structural errors first.
+
+If you are using the checked-in example files from the repo root, use the full
+paths shown in the commands below. If you created your own copies locally,
+replace them with your local filenames.
+
+### A valid policy (`06_test_policy.yaml`)
+
+This policy combines concepts from earlier chapters — allow, deny, escalation,
+and a default — into a single file designed for testing:
+
+```yaml
+version: "1.0"
+name: test-policy
+description: >
+  Combined policy for automated testing.  Covers allow, deny,
+  escalation-tagged deny, and default-allow so that test scenarios
+  can verify every decision path in one pass.
+
+rules:
+  # Tier 1: Always denied — irreversibly destructive
+  - name: block-delete-database
+    condition:
+      field: tool_name
+      operator: eq
+      value: delete_database
+    action: deny
+    priority: 100
+    message: "Destructive action: deleting databases is never allowed"
+
+  # Tier 2: Escalation — needs human review
+  - name: escalate-transfer-funds
+    condition:
+      field: tool_name
+      operator: eq
+      value: transfer_funds
+    action: deny
+    priority: 90
+    message: "Sensitive action: transfer_funds requires human approval"
+
+  - name: escalate-send-email
+    condition:
+      field: tool_name
+      operator: eq
+      value: send_email
+    action: deny
+    priority: 85
+    message: "Sensitive action: send_email requires human approval"
+
+  # Tier 3: Always allowed — safe, read-only actions
+  - name: allow-search-documents
+    condition:
+      field: tool_name
+      operator: eq
+      value: search_documents
+    action: allow
+    priority: 80
+    message: "Safe action: searching documents is always allowed"
+
+  # Tier 4: Explicit deny — not needed by this agent
+  - name: block-write-file
+    condition:
+      field: tool_name
+      operator: eq
+      value: write_file
+    action: deny
+    priority: 70
+    message: "Write access is not permitted for this agent"
+
+defaults:
+  action: allow
+  max_tool_calls: 10
+```
+
+Five rules, four decision tiers, one default. Enough to test every path.
+
+### Loading and validating
+
+```python
+from pathlib import Path
+
+from agent_os.policies.schema import PolicyDocument
+
+examples_dir = Path("docs/tutorials/policy-as-code/examples")
+
+policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml")
+print(policy.name)        # "test-policy"
+print(len(policy.rules))  # 5
+```
+
+`PolicyDocument.from_yaml()` does two things: it parses the YAML and validates
+it against the schema. If the file is valid, you get a `PolicyDocument` object.
+If not, you get a `ValidationError` that tells you exactly what is wrong.
+
+### A broken policy
+
+What if someone types `equals` instead of `eq`?
+
+```python
+from pydantic import ValidationError
+
+broken = {
+    "version": "1.0",
+    "name": "broken-policy",
+    "rules": [{
+        "name": "bad-rule",
+        "condition": {
+            "field": "tool_name",
+            "operator": "equals",   # wrong — should be "eq"
+            "value": "send_email",
+        },
+        "action": "deny",
+    }],
+}
+
+try:
+    PolicyDocument.model_validate(broken)
+except ValidationError as exc:
+    print(exc.errors()[0]["msg"])
+```
+
+### Example output
+
+```
+  🚫 Validation failed (as expected):
+     Field:   rules -> 0 -> condition -> operator
+     Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains'
+```
+
+The error message tells you the exact path (`rules -> 0 -> condition ->
+operator`) and the valid values. You do not need to guess.
+
+### Using the CLI
+
+The same validation is available as a command:
+
+```bash
+python -m agent_os.policies.cli validate \
+  docs/tutorials/policy-as-code/examples/06_test_policy.yaml
+```
+
+```
+OK
+```
+
+Exit code 0 means the file is valid. Exit code 1 means validation failed
+(with the error printed to stderr). Exit code 2 means the file could not be
+found or parsed.
+
+---
+
+## Step 2: Write test scenarios
+
+Validation tells you the YAML is *structured correctly*. Test scenarios tell
+you the policy *behaves correctly* — that each tool gets the right decision.
+
+### The scenarios file (`06_test_scenarios.yaml`)
+
+```yaml
+scenarios:
+  # Always allowed
+  - name: search-documents-allowed
+    context: { tool_name: search_documents }
+    expected_action: allow
+
+  # Always denied (destructive)
+  - name: delete-database-denied
+    context: { tool_name: delete_database }
+    expected_action: deny
+
+  # Escalation-tagged (deny with "requires human approval")
+  - name: transfer-funds-denied
+    context: { tool_name: transfer_funds }
+    expected_action: deny
+
+  - name: send-email-denied
+    context: { tool_name: send_email }
+    expected_action: deny
+
+  # Explicit deny
+  - name: write-file-denied
+    context: { tool_name: write_file }
+    expected_action: deny
+
+  # Default action (tool not in any rule)
+  - name: unknown-tool-uses-default
+    context: { tool_name: read_logs }
+    expected_action: allow
+
+  # Same checks using expected_allowed (boolean)
+  - name: search-documents-is-allowed
+    context: { tool_name: search_documents }
+    expected_allowed: true
+
+  - name: delete-database-is-not-allowed
+    context: { tool_name: delete_database }
+    expected_allowed: false
+```
+
+Each scenario names a context and an expected result. You can check either the
+action string (`expected_action`) or the boolean (`expected_allowed`).
+
+### Running with the CLI
+
+```bash
+python -m agent_os.policies.cli test \
+  docs/tutorials/policy-as-code/examples/06_test_policy.yaml \
+  docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml
+```
+
+```
+8/8 scenarios passed
+```
+
+If any scenario fails, the CLI prints which one and what went wrong:
+
+```
+FAIL: transfer-funds-denied: expected deny, got allow
+7/8 scenarios passed
+```
+
+Exit code 0 means all passed. Exit code 1 means at least one failed.
+
+### Running in Python
+
+The CLI is convenient, but sometimes you want the results in Python — for
+custom formatting, integration with a CI pipeline, or testing multiple
+policies at once.
+
+```python
+from pathlib import Path
+
+import yaml
+from agent_os.policies import PolicyEvaluator
+from agent_os.policies.schema import PolicyDocument
+
+examples_dir = Path("docs/tutorials/policy-as-code/examples")
+
+policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml")
+evaluator = PolicyEvaluator(policies=[policy])
+
+with open(examples_dir / "06_test_scenarios.yaml") as f:
+    scenarios = yaml.safe_load(f)["scenarios"]
+
+for scenario in scenarios:
+    decision = evaluator.evaluate(scenario["context"])
+    expected = scenario.get("expected_action")
+    actual = decision.action
+    ok = (expected is None) or (actual == expected)
+    status = "✅ pass" if ok else "❌ FAIL"
+    print(f"{scenario['name']}: {status}")
+```
+
+### Example output
+
+```
+  Scenario                         Expected   Actual     Result
+  --------------------------------------------------------------------
+  search-documents-allowed         allow      allow      ✅ pass
+  delete-database-denied           deny       deny       ✅ pass
+  transfer-funds-denied            deny       deny       ✅ pass
+  send-email-denied                deny       deny       ✅ pass
+  write-file-denied                deny       deny       ✅ pass
+  unknown-tool-uses-default        allow      allow      ✅ pass
+  search-documents-is-allowed      true       true       ✅ pass
+  delete-database-is-not-allowed   false      false      ✅ pass
+
+  ✅ 8/8 scenarios passed
+```
+
+---
+
+## Step 3: The test matrix
+
+The scenarios in Step 2 test one policy in isolation. But in production,
+**multiple policies apply at the same time**: the reader policy from Chapter 2
+and the environment policy from Chapter 4. When both are active, their rules
+merge and interact. A rule from one policy can override a rule from another —
+and the result might not be what anyone intended.
+
+A **test matrix** crosses every role, every environment, and every tool. It
+tests the *combined system*, not individual pieces.
+
+### Building the combined system
+
+Load the role policies from Chapter 2 and the environment policy from Chapter
+4. For each role, combine its policy with the shared environment policy:
+
+```python
+from pathlib import Path
+
+from agent_os.policies import PolicyEvaluator
+from agent_os.policies.schema import PolicyDocument
+
+examples_dir = Path("docs/tutorials/policy-as-code/examples")
+
+reader_policy = PolicyDocument.from_yaml(examples_dir / "02_reader_policy.yaml")
+admin_policy = PolicyDocument.from_yaml(examples_dir / "02_admin_policy.yaml")
+env_policy = PolicyDocument.from_yaml(examples_dir / "04_env_policy.yaml")
+
+# Each role gets its own policy + the shared environment policy.
+# The evaluator merges all rules and sorts by priority.
+role_policies = {
+    "reader": [reader_policy, env_policy],
+    "admin":  [admin_policy, env_policy],
+}
+
+tools = ["search_documents", "write_file", "send_email",
+         "delete_database", "transfer_funds"]
+environments = ["development", "production"]
+
+for tool in tools:
+    for role, policies in role_policies.items():
+        for env in environments:
+            evaluator = PolicyEvaluator(policies=list(policies))
+            decision = evaluator.evaluate({"tool_name": tool, "environment": env})
+            # check against expected ...
+```
+
+When two policies are loaded into one evaluator, their rules are merged into a
+single list sorted by priority. The first rule that matches the context wins.
+This is where surprising interactions happen.
+
+### Example output
+
+```
+  Tool                   reader/dev  reader/prod  admin/dev   admin/prod
+  -----------------------------------------------------------------------
+  search_documents       ✅ allow     🚫 deny      ✅ allow    🚫 deny
+  write_file             ✅ allow ⚠️  🚫 deny      ✅ allow    🚫 deny
+  send_email             🚫 deny      🚫 deny      ✅ allow    🚫 deny
+  delete_database        🚫 deny      🚫 deny      🚫 deny    🚫 deny
+  transfer_funds         ✅ allow     🚫 deny      ✅ allow    🚫 deny
+
+  19/20 cells match expectations.  1 surprise:
+
+  ⚠️  reader + development + write_file
+     Expected: deny (reader policy blocks write_file at priority 80)
+     Actual:   allow (environment policy allows development at priority 90)
+     Reason:   Development environment: agents can act freely
+```
+
+### What just happened?
+
+The matrix found a real interaction bug. `block-write-file` is priority 80, but
+`allow-development` is priority 90, so the environment rule wins first and the
+reader is allowed to write files in development. You would not catch that by
+reading the YAML files one at a time.
+
+---
+
+## Step 4: Catch a regression
+
+This is the payoff. Here is a bug that would be nearly invisible to a human
+reviewer — but a test catches it instantly.
+
+### The scenario
+
+Someone edits the policy and changes the `transfer_funds` rule's message from:
+
+```
+"Sensitive action: transfer_funds requires human approval"
+```
+
+to:
+
+```
+"Sensitive action: transfer_funds is blocked"
+```
+
+The rule still says `action: deny`. Nothing else changed. A YAML diff shows
+one line modified. A human reviewer might glance at it and approve.
+
+But in the code, the escalation system uses the phrase `"requires human
+approval"` in the message to distinguish an escalation from a hard deny
+(Chapter 5). Removing that phrase silently converts an escalation — where a
+human could approve the transfer — into an unconditional block.
+
+### What the test shows
+
+```
+  Original policy:  transfer_funds → ⏳ escalate (escalate)
+  Modified policy:  transfer_funds → 🚫 deny     (deny)
+
+  ❌ Regression detected!
+     transfer_funds changed from 'escalate' to 'deny'.
+     The edit removed the escalation keyword, so the action
+     that used to pause for human review now silently blocks.
+```
+
+The test compared the *classification* of the decision, not just the raw
+action string. Both versions return `action: deny`, but only the original still
+means "escalate."
+
+---
+
+## Full example
+
+```bash
+python docs/tutorials/policy-as-code/examples/06_policy_testing.py
+```
+
+```
+============================================================
+  Chapter 6: Policy Testing
+============================================================
+
+--- Part 1: Validate the structure ---
+
+  ✅ 'test-policy' loaded successfully
+     5 rules, default action: allow
+
+  🚫 Validation failed (as expected):
+     Field:   rules -> 0 -> condition -> operator
+     Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains'
+
+  PolicyDocument.from_yaml() catches structural errors
+  before any rule is evaluated. A typo like 'equals'
+  instead of 'eq' is caught immediately.
+
+--- Part 2: Run test scenarios ---
+
+  Scenario                         Expected   Actual     Result
+  --------------------------------------------------------------------
+  search-documents-allowed         allow      allow      ✅ pass
+  delete-database-denied           deny       deny       ✅ pass
+  transfer-funds-denied            deny       deny       ✅ pass
+  send-email-denied                deny       deny       ✅ pass
+  write-file-denied                deny       deny       ✅ pass
+  unknown-tool-uses-default        allow      allow      ✅ pass
+  search-documents-is-allowed      true       true       ✅ pass
+  delete-database-is-not-allowed   false      false      ✅ pass
+
+  ✅ 8/8 scenarios passed
+
+  Each scenario is one line in a YAML file. The test runner
+  evaluates the policy and compares the actual result to the
+  expected result. No manual checking required.
+
+--- Part 3: The test matrix ---
+
+  Loading policies from chapters 2 and 4...
+
+  Tool                   reader/dev  reader/prod  admin/dev   admin/prod
+  -----------------------------------------------------------------------
+  search_documents       ✅ allow     🚫 deny      ✅ allow    🚫 deny
+  write_file             ✅ allow ⚠️  🚫 deny      ✅ allow    🚫 deny
+  send_email             🚫 deny      🚫 deny      ✅ allow    🚫 deny
+  delete_database        🚫 deny      🚫 deny      🚫 deny    🚫 deny
+  transfer_funds         ✅ allow     🚫 deny      ✅ allow    🚫 deny
+
+  19/20 cells match expectations.  1 surprise(s):
+
+  ⚠️  reader + development + write_file
+     Expected: deny
+     Actual:   allow (from rule: allow-development)
+     Reason:   Development environment: agents can act freely
+
+  The reader policy blocks write_file at priority 80.
+  But the environment policy allows development at priority 90.
+  Priority 90 beats 80 — the environment rule fires first.
+  Without the test matrix, this interaction is invisible.
+
+--- Part 4: Catch a regression ---
+
+  Scenario: someone edits the policy and removes the phrase
+  "requires human approval" from the transfer_funds rule.
+  The tool silently flips from escalate to hard deny.
+
+  Original policy:  transfer_funds → ⏳ escalate (escalate)
+  Modified policy:  transfer_funds → 🚫 deny     (deny)
+
+  ❌ Regression detected!
+     transfer_funds changed from 'escalate' to 'deny'.
+     The edit removed the escalation keyword, so the action
+     that used to pause for human review now silently blocks.
+
+  A human scanning the YAML diff might miss this. But a test
+  scenario that checks for the escalation keyword catches it
+  instantly. That is the value of automated policy testing:
+  changes that look harmless cannot silently break behavior.
+
+============================================================
+  Policies are code. Test them like code.
+  Validate the structure, write expected outcomes,
+  run them automatically, and catch regressions
+  before they reach production.
+============================================================
+```
+
+---
+
+## How does it work?
+
+```
+  Role policy     Environment policy
+  (ch2)           (ch4)
+      │                │
+      └────────┬───────┘
+               ▼
+  ┌─────────────────────────────────┐
+  │  1. Validate each file          │
+  │     PolicyDocument.from_yaml()  │
+  └──────────┬──────────────────────┘
+             ▼
+  ┌─────────────────────────────────┐
+  │  2. Test each policy alone      │
+  │     CLI: policy test            │
+  └──────────┬──────────────────────┘
+             ▼
+  ┌─────────────────────────────────┐
+  │  3. Test the combined system    │
+  │     Python: multi-policy eval   │
+  └──────────┬──────────────────────┘
+             │
+      ┌──────┴──────┐
+      ▼             ▼
+  All pass     Surprises found
+  ✅ Deploy    ❌ Fix and re-run
+```
+
+| Tool | What it does |
+|------|-------------|
+| `PolicyDocument.from_yaml(path)` | Load YAML and validate against Pydantic schema |
+| `PolicyDocument.model_validate(dict)` | Validate a Python dict without loading a file |
+| `PolicyEvaluator(policies=[...])` | Merge rules from multiple policies |
+| `evaluator.evaluate(context)` | Return a `PolicyDecision` with `allowed`, `action`, `reason` |
+| `policy validate <file>` | CLI: validate structure, print OK or FAIL |
+| `policy test <policy> <scenarios>` | CLI: run scenarios, print pass count |
+
+---
+
+## Try it yourself
+
+1. **Fix the surprise.** The test matrix found that `reader + development +
+   write_file` is unexpectedly allowed. Edit `02_reader_policy.yaml` and
+   raise `block-write-file`'s priority to 95 (above the environment policy's
+   90). Re-run the script — the ⚠️ should disappear.
+
+2. **Add a staging environment.** The environment policy has rules for
+   development and production, but not staging. Add `staging` to the
+   environments list in the test matrix. What happens? Does the default deny
+   or allow? Add a scenario to verify.
+
+3. **Extend the matrix.** Create a third policy file for an "operator" role
+   that can search documents and send emails but cannot write files or delete
+   databases. Add it to the Python test matrix and verify the results across
+   all environments.
+
+---
+
+## What's missing?
+
+Policies change over time. Legal tells you that `write_file` must now be
+blocked in production, not just for readers. The policy needs to be updated
+from version 1.0 to version 2.0. But how do you make that change without
+accidentally breaking something that was already working?
+
+You need a way to **compare two versions** side by side — see exactly what
+changed, run the test suite against *both* versions, and find regressions
+before the new version goes live. That is policy versioning.
+
+**Previous:** [Chapter 5 — Approval Workflows](05-approval-workflows.md)
+**Next:** [Chapter 7 — Policy Versioning](07-policy-versioning.md) — compare
+v1 vs v2 behavior, catch regressions before deploying.
diff --git a/docs/tutorials/policy-as-code/07-policy-versioning.md b/docs/tutorials/policy-as-code/07-policy-versioning.md
new file mode 100644
index 00000000..5f63bd19
--- /dev/null
+++ b/docs/tutorials/policy-as-code/07-policy-versioning.md
@@ -0,0 +1,317 @@
+<!-- Copyright (c) Microsoft Corporation. -->
+<!-- Licensed under the MIT License. -->
+
+# Chapter 7: Policy Versioning
+
+Chapter 6 proved that your policies work *right now*. But policies change.
+Legal tells you that `send_email` should be a hard block, not an escalation.
+Someone fixes that — and accidentally breaks `transfer_funds` in the same
+edit. You need a way to compare two versions, test both, and catch the
+regression before the new version goes live.
+
+**What you'll learn:**
+
+| Section | Topic |
+|---------|-------|
+| [Two versions side by side](#step-1-two-versions-side-by-side) | What changed between v1 and v2 |
+| [Diff with the CLI](#step-2-diff-with-the-cli) | See every structural change in one command |
+| [Test both versions](#step-3-test-both-versions) | Run the same contexts against v1 and v2 |
+| [Catch the regression](#step-4-catch-the-regression) | Separate expected changes from accidents |
+
+---
+
+## Step 1: Two versions side by side
+
+Version 1.0 is the production baseline — the same combined policy from
+Chapter 6 with five rules covering all decision tiers.
+
+Version 2.0 has three changes:
+
+| # | Change | Intentional? |
+|---|--------|-------------|
+| 1 | `block-write-file` priority raised from 70 to 95 | Yes — fixes the Chapter 6 surprise where the environment policy overrode the block |
+| 2 | `escalate-send-email` message no longer says "requires human approval" | Yes — legal decided send_email should be fully blocked |
+| 3 | `escalate-transfer-funds` message no longer says "requires human approval" | No — accidental edit, breaks the escalation |
+
+Changes 1 and 2 are intentional. Change 3 happened because someone edited
+both escalation rules instead of just one. The YAML diff looks like a
+routine cleanup. The damage is invisible without a behavioral test.
+
+---
+
+## Step 2: Diff with the CLI
+
+The built-in `diff` command compares two policy files structurally:
+
+```bash
+python -m agent_os.policies.cli diff \
+    examples/07_policy_v1.yaml \
+    examples/07_policy_v2.yaml
+```
+
+```
+rule escalate-transfer-funds: message: Sensitive action: transfer_funds requires human approval -> Sensitive action: transfer_funds is blocked
+rule escalate-send-email: message: Sensitive action: send_email requires human approval -> Communication: send_email is blocked by policy
+rule block-write-file: priority: 70 -> 95
+version: 1.0 -> 2.0
+```
+
+Every structural change is listed: two messages changed, one priority
+raised, and the version bumped. But the diff does not tell you which change
+breaks behavior. For that, you need to run both versions through the same
+tests.
+
+---
+
+## Step 3: Test both versions
+
+Load v1 and v2 into separate evaluators and run the same five tools through
+both. Use the `classify()` helper from Chapter 6 to tag each result as
+allow, escalate, or deny:
+
+```python
+from pathlib import Path
+
+from agent_os.policies import PolicyEvaluator
+from agent_os.policies.schema import PolicyDocument
+
+examples_dir = Path("docs/tutorials/policy-as-code/examples")
+
+v1 = PolicyDocument.from_yaml(examples_dir / "07_policy_v1.yaml")
+v2 = PolicyDocument.from_yaml(examples_dir / "07_policy_v2.yaml")
+
+eval_v1 = PolicyEvaluator(policies=[v1])
+eval_v2 = PolicyEvaluator(policies=[v2])
+
+ESCALATION_KEYWORD = "requires human approval"
+
+def classify(decision):
+    if decision.allowed:
+        return "allow"
+    if decision.reason and ESCALATION_KEYWORD in decision.reason.lower():
+        return "escalate"
+    return "deny"
+
+tools = ["search_documents", "write_file", "send_email",
+         "delete_database", "transfer_funds"]
+
+for tool in tools:
+    ctx = {"tool_name": tool}
+    t1 = classify(eval_v1.evaluate(ctx))
+    t2 = classify(eval_v2.evaluate(ctx))
+    changed = "⚠️" if t1 != t2 else ""
+    print(f"{tool:<22s} {t1:<12s} {t2:<12s} {changed}")
+```
+
+### Example output
+
+```
+  Tool                   v1             v2             Changed?
+  ----------------------------------------------------------
+  search_documents       ✅ allow        ✅ allow
+  write_file             🚫 deny         🚫 deny
+  send_email             ⏳ escalate     🚫 deny         ⚠️  yes
+  delete_database        🚫 deny         🚫 deny
+  transfer_funds         ⏳ escalate     🚫 deny         ⚠️  yes
+
+  2 tool(s) changed behavior between versions.
+```
+
+Two tools changed: `send_email` and `transfer_funds`. Both went from
+escalate to deny. The structural diff showed three changes, but the
+behavioral test shows only two matter. The `write_file` priority change
+does not affect single-policy evaluation — it matters when combined with
+the environment policy (that is what the Chapter 6 test matrix would
+catch).
+
+---
+
+## Step 4: Catch the regression
+
+The team planned one behavioral change: `send_email` should become a hard
+deny. Anything else that changed is a regression.
+
+```python
+expected_changes = {"send_email"}
+
+for tool, tier1, tier2, changed in results:
+    if not changed:
+        continue
+    if tool in expected_changes:
+        print(f"✅ {tool}: {tier1} → {tier2} (expected)")
+    else:
+        print(f"❌ {tool}: {tier1} → {tier2} (REGRESSION)")
+```
+
+```
+  ✅ send_email: escalate → deny (expected — legal decision)
+  ❌ transfer_funds: escalate → deny (REGRESSION)
+
+  ❌ Regression: transfer_funds
+     Was 'escalate' in v1, now 'deny' in v2.
+     The v2 edit removed the escalation keyword from the
+     message, so the action that used to pause for human
+     review now silently blocks.
+
+  Fix the regression in v2, then re-run this comparison.
+  Do not deploy until all changes are expected.
+```
+
+The regression is the same type Chapter 6 caught in Part 4 — removing
+`"requires human approval"` silently converts an escalation into a hard
+deny. But this time, the test compares *two versions* instead of checking
+one version in isolation. That is what makes it a versioning check: you can
+see exactly when the behavior changed and which edit caused it.
+
+---
+
+## Full example
+
+```bash
+python docs/tutorials/policy-as-code/examples/07_policy_versioning.py
+```
+
+```
+============================================================
+  Chapter 7: Policy Versioning
+============================================================
+
+--- Part 1: Load both versions ---
+
+  v1: 'production-policy' version 1.0  (5 rules)
+  v2: 'production-policy' version 2.0  (5 rules)
+
+--- Part 2: Diff the two versions ---
+
+  version: 1.0 → 2.0
+  rule escalate-transfer-funds: message changed
+    was: "Sensitive action: transfer_funds requires human approval"
+    now: "Sensitive action: transfer_funds is blocked"
+  rule escalate-send-email: message changed
+    was: "Sensitive action: send_email requires human approval"
+    now: "Communication: send_email is blocked by policy"
+  rule block-write-file: priority 70 → 95
+
+  The diff lists every structural change. But a diff cannot
+  tell you whether a change is safe. You need to test both
+  versions and compare the results.
+
+--- Part 3: Test both versions ---
+
+  Tool                   v1             v2             Changed?
+  ----------------------------------------------------------
+  search_documents       ✅ allow        ✅ allow
+  write_file             🚫 deny         🚫 deny
+  send_email             ⏳ escalate     🚫 deny         ⚠️  yes
+  delete_database        🚫 deny         🚫 deny
+  transfer_funds         ⏳ escalate     🚫 deny         ⚠️  yes
+
+  2 tool(s) changed behavior between versions.
+
+--- Part 4: Detect regressions ---
+
+  ✅ send_email: escalate → deny (expected — legal decision)
+  ❌ transfer_funds: escalate → deny (REGRESSION)
+
+  ❌ Regression: transfer_funds
+     Was 'escalate' in v1, now 'deny' in v2.
+     The v2 edit removed the escalation keyword from the
+     message, so the action that used to pause for human
+     review now silently blocks.
+
+  Fix the regression in v2, then re-run this comparison.
+  Do not deploy until all changes are expected.
+
+============================================================
+  Policy versioning closes the loop.
+  Tag a version, diff it, test both, catch regressions.
+  No policy update ships without passing this check.
+============================================================
+```
+
+---
+
+## How does it work?
+
+```
+  v1.yaml          v2.yaml
+     │                │
+     └────────┬───────┘
+              ▼
+  ┌───────────────────────────┐
+  │  1. Diff                  │
+  │     CLI: policy diff      │
+  │     List structural diffs │
+  └──────────┬────────────────┘
+             ▼
+  ┌───────────────────────────┐
+  │  2. Test both             │
+  │     Same contexts, same   │
+  │     classify() function   │
+  └──────────┬────────────────┘
+             │
+      ┌──────┴──────┐
+      ▼             ▼
+  No changes    Changes found
+  ✅ Safe to     ↓
+  deploy      ┌──────────────┐
+              │ 3. Classify  │
+              │ Expected vs  │
+              │ Regression   │
+              └──────┬───────┘
+                     │
+              ┌──────┴──────┐
+              ▼             ▼
+          Expected      Regression
+          ✅ Deploy     ❌ Fix first
+```
+
+| Tool | What it does |
+|------|-------------|
+| `policy diff v1.yaml v2.yaml` | CLI: structural diff between two policy files |
+| `PolicyDocument.from_yaml(path)` | Load and validate a policy file |
+| `PolicyEvaluator(policies=[doc])` | Create an evaluator from a PolicyDocument |
+| `evaluator.evaluate(context)` | Return a `PolicyDecision` with `allowed`, `action`, `reason` |
+| `classify(decision)` | Tag a decision as allow, escalate, or deny (from Chapter 6) |
+
+---
+
+## Try it yourself
+
+1. **Add a new rule in v2.** Create a rule `block-execute-code` that denies
+   `execute_code` in v2 only. Re-run the diff — it should show "rule
+   added." Test both versions to confirm the new rule only affects v2, and
+   add it to `expected_changes` so it does not flag as a regression.
+
+2. **Bridge conversion.** Import `governance_to_document` from
+   `agent_os.policies.bridge` and convert a `GovernancePolicy` object
+   into a `PolicyDocument`. Diff the result against v1 to see how the
+   legacy format maps to the declarative format.
+
+3. **Automate the gate.** Write a function `is_safe_to_deploy(v1_path,
+   v2_path, expected)` that loads both files, diffs them, tests both,
+   and returns `True` only if every behavioral change is in the
+   `expected` set. This is a deploy gate — run it in CI before any policy
+   update ships.
+
+---
+
+## What you've built
+
+Over seven chapters, you built a complete policy governance system:
+
+| Chapter | Layer |
+|---------|-------|
+| 1 | Block dangerous tools |
+| 2 | Scope permissions by role |
+| 3 | Rate-limit actions |
+| 4 | Resolve conflicts between policies |
+| 5 | Escalate sensitive actions to humans |
+| 6 | Test policies automatically |
+| 7 | Update policies safely with regression detection |
+
+Each layer added one concept. Together, they form a system that can
+govern AI agents in production: who can do what, how often, who approves,
+how you test it, and how you update it without breaking what already works.
+
+**Previous:** [Chapter 6 — Policy Testing](06-policy-testing.md)
diff --git a/docs/tutorials/policy-as-code/README.md b/docs/tutorials/policy-as-code/README.md
index 68fef7c3..9f34579a 100644
--- a/docs/tutorials/policy-as-code/README.md
+++ b/docs/tutorials/policy-as-code/README.md
@@ -20,10 +20,8 @@ pip install agent-os-kernel[full]
 | [03 — Rate Limiting](03-rate-limiting.md) | Preventing runaway agents | Set limits on how many actions an agent can take |
 | [04 — Conditional Policies](04-conditional-policies.md) | Policy composition and conflict resolution | Layer base + environment policies with conflict strategies |
 | [05 — Approval Workflows](05-approval-workflows.md) | Human-in-the-loop for sensitive actions | Route dangerous actions to a human before execution |
-| 06 — Policy Testing | Systematic validation with test matrices | Test every role + action + environment combination |
-| 07 — Policy Versioning | Safe rollout of policy changes | Compare v1 vs v2 behavior, catch regressions before deploying |
-
-> Chapters 06–07 are coming soon.
+| [06 — Policy Testing](06-policy-testing.md) | Systematic validation with test matrices | Test every role + action + environment combination |
+| [07 — Policy Versioning](07-policy-versioning.md) | Safe rollout of policy changes | Compare v1 vs v2 behavior, catch regressions before deploying |
 
 ## Running Examples
 
diff --git a/docs/tutorials/policy-as-code/examples/01_first_policy.yaml b/docs/tutorials/policy-as-code/examples/01_first_policy.yaml
index a970afa4..70a9e86b 100644
--- a/docs/tutorials/policy-as-code/examples/01_first_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/01_first_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: my-first-policy
 description: A simple policy that blocks dangerous agent actions
diff --git a/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml b/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml
index eb2598af..6a53a1c3 100644
--- a/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: admin-policy
 description: Permissive policy for admin agents — only the most dangerous actions are blocked
diff --git a/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml b/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml
index 3b742e91..92f4c648 100644
--- a/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: reader-policy
 description: Restrictive policy for read-only agents
diff --git a/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml b/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml
index 71555815..f8d30b09 100644
--- a/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: rate-limit-policy
 description: Policy that limits how many tool calls an agent can make
diff --git a/docs/tutorials/policy-as-code/examples/04_env_policy.yaml b/docs/tutorials/policy-as-code/examples/04_env_policy.yaml
index d3c41812..4430c633 100644
--- a/docs/tutorials/policy-as-code/examples/04_env_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/04_env_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: environment-policy
 description: Rules that change based on the deployment environment
diff --git a/docs/tutorials/policy-as-code/examples/04_global_policy.yaml b/docs/tutorials/policy-as-code/examples/04_global_policy.yaml
index f638dde2..137d5c17 100644
--- a/docs/tutorials/policy-as-code/examples/04_global_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/04_global_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: global-security-policy
 description: Company-wide rules set by the security team
diff --git a/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml b/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml
index 10c3bb22..07aa0fba 100644
--- a/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml
+++ b/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml
@@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 version: "1.0"
 name: support-team-policy
 description: Rules for the customer support team's agent
diff --git a/docs/tutorials/policy-as-code/examples/06_policy_testing.py b/docs/tutorials/policy-as-code/examples/06_policy_testing.py
new file mode 100644
index 00000000..d4091bee
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/06_policy_testing.py
@@ -0,0 +1,308 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+"""
+Chapter 6: Policy Testing — Automated Validation and Test Scenarios
+
+Shows how to validate policy structure, run declarative test scenarios,
+build a role-by-tool test matrix, and catch regressions automatically.
+
+Run from the repo root:
+    pip install agent-os-kernel[full]
+    python docs/tutorials/policy-as-code/examples/06_policy_testing.py
+"""
+
+from __future__ import annotations
+
+import copy
+import sys
+from pathlib import Path
+
+import yaml
+
+# Allow running from the repo root without installing the packages.
+_REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent
+sys.path.insert(0, str(_REPO_ROOT / "packages" / "agent-os" / "src"))
+
+from pydantic import ValidationError
+
+from agent_os.policies import PolicyEvaluator
+from agent_os.policies.schema import PolicyDocument
+
+EXAMPLES_DIR = Path(__file__).parent
+
+ESCALATION_KEYWORD = "requires human approval"
+
+
+def classify(decision):
+    """Classify a PolicyDecision into allow / escalate / deny."""
+    if decision.allowed:
+        return ("allow", "\u2705 allow   ")
+    if decision.reason and ESCALATION_KEYWORD in decision.reason.lower():
+        return ("escalate", "\u23f3 escalate")
+    return ("deny", "\U0001f6ab deny    ")
+
+
+# ── Part 1: Validate the structure ────────────────────────────────────
+
+print("=" * 60)
+print("  Chapter 6: Policy Testing")
+print("=" * 60)
+
+print("\n--- Part 1: Validate the structure ---\n")
+
+# 1a — Load a valid policy
+policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "06_test_policy.yaml")
+print(f"  \u2705 '{policy.name}' loaded successfully")
+print(f"     {len(policy.rules)} rules, default action: {policy.defaults.action.value}")
+
+# 1b — Try to validate a broken policy
+print()
+broken_data = {
+    "version": "1.0",
+    "name": "broken-policy",
+    "rules": [
+        {
+            "name": "bad-rule",
+            "condition": {
+                "field": "tool_name",
+                "operator": "equals",  # wrong — should be "eq"
+                "value": "send_email",
+            },
+            "action": "deny",
+        }
+    ],
+}
+
+try:
+    PolicyDocument.model_validate(broken_data)
+    print("  Unexpected: broken policy passed validation")
+except ValidationError as exc:
+    # Show only the first error for readability
+    first_error = exc.errors()[0]
+    print(f"  \U0001f6ab Validation failed (as expected):")
+    print(f"     Field:   {' -> '.join(str(p) for p in first_error['loc'])}")
+    print(f"     Problem: {first_error['msg']}")
+
+print()
+print("  PolicyDocument.from_yaml() catches structural errors")
+print("  before any rule is evaluated. A typo like 'equals'")
+print("  instead of 'eq' is caught immediately.")
+
+# ── Part 2: Run test scenarios ────────────────────────────────────────
+
+print("\n--- Part 2: Run test scenarios ---\n")
+
+# Load the scenarios file
+scenarios_path = EXAMPLES_DIR / "06_test_scenarios.yaml"
+with open(scenarios_path) as f:
+    scenarios_data = yaml.safe_load(f)
+
+scenarios = scenarios_data["scenarios"]
+evaluator = PolicyEvaluator(policies=[policy])
+
+passed = 0
+failed = 0
+
+print(f"  {'Scenario':<32s} {'Expected':<10s} {'Actual':<10s} Result")
+print(f"  {'-' * 68}")
+
+for scenario in scenarios:
+    name = scenario["name"]
+    context = scenario.get("context", {})
+    expected_action = scenario.get("expected_action")
+    expected_allowed = scenario.get("expected_allowed")
+
+    decision = evaluator.evaluate(context)
+    actual_action = decision.action
+    actual_allowed = decision.allowed
+
+    ok = True
+    if expected_action is not None and actual_action != expected_action:
+        ok = False
+    if expected_allowed is not None and actual_allowed != expected_allowed:
+        ok = False
+
+    # For display, show whichever field the scenario tested
+    expected_display = expected_action if expected_action is not None else str(expected_allowed).lower()
+    actual_display = actual_action if expected_action is not None else str(actual_allowed).lower()
+
+    status = "\u2705 pass" if ok else "\u274c FAIL"
+    print(f"  {name:<32s} {expected_display:<10s} {actual_display:<10s} {status}")
+
+    if ok:
+        passed += 1
+    else:
+        failed += 1
+
+total = passed + failed
+print()
+if failed == 0:
+    print(f"  \u2705 {passed}/{total} scenarios passed")
+else:
+    print(f"  \u274c {passed}/{total} scenarios passed, {failed} failed")
+
+print()
+print("  Each scenario is one line in a YAML file. The test runner")
+print("  evaluates the policy and compares the actual result to the")
+print("  expected result. No manual checking required.")
+
+# ── Part 3: The test matrix ──────────────────────────────────────────
+
+print("\n--- Part 3: The test matrix ---\n")
+
+print("  Loading policies from chapters 2 and 4...")
+
+# Role policies from Chapter 2
+reader_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "02_reader_policy.yaml")
+admin_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "02_admin_policy.yaml")
+
+# Environment policy from Chapter 4
+env_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "04_env_policy.yaml")
+
+# Combine: each role gets its own policy + the shared environment policy.
+# The evaluator merges all rules and sorts by priority — the first
+# matching rule wins.  This is where surprising interactions happen.
+role_policies = {
+    "reader": [reader_policy, env_policy],
+    "admin":  [admin_policy, env_policy],
+}
+
+environments = ["development", "production"]
+tools = [
+    "search_documents",
+    "write_file",
+    "send_email",
+    "delete_database",
+    "transfer_funds",
+]
+
+# What the team intends — the "answer key":
+#   Reader:   cannot write_file, send_email, delete_database (from ch2)
+#   Admin:    cannot delete_database (from ch2)
+#   Production: everything blocked (from ch4)
+#   Development: role-based rules apply
+intended = {
+    ("reader", "development", "search_documents"): True,
+    ("reader", "development", "write_file"):       False,  # ch2 blocks it
+    ("reader", "development", "send_email"):       False,
+    ("reader", "development", "delete_database"):  False,
+    ("reader", "development", "transfer_funds"):   True,
+    ("reader", "production",  "search_documents"): False,
+    ("reader", "production",  "write_file"):       False,
+    ("reader", "production",  "send_email"):       False,
+    ("reader", "production",  "delete_database"):  False,
+    ("reader", "production",  "transfer_funds"):   False,
+    ("admin",  "development", "search_documents"): True,
+    ("admin",  "development", "write_file"):       True,
+    ("admin",  "development", "send_email"):       True,
+    ("admin",  "development", "delete_database"):  False,
+    ("admin",  "development", "transfer_funds"):   True,
+    ("admin",  "production",  "search_documents"): False,
+    ("admin",  "production",  "write_file"):       False,
+    ("admin",  "production",  "send_email"):       False,
+    ("admin",  "production",  "delete_database"):  False,
+    ("admin",  "production",  "transfer_funds"):   False,
+}
+
+# Print the matrix header
+print()
+print(f"  {'Tool':<22s}", end="")
+for role in role_policies:
+    for env in environments:
+        short_env = "dev" if env == "development" else "prod"
+        label = f"{role}/{short_env}"
+        print(f" {label:<13s}", end="")
+print()
+print(f"  {'-' * 74}")
+
+matrix_pass = 0
+matrix_total = 0
+surprises = []
+
+for tool in tools:
+    print(f"  {tool:<22s}", end="")
+    for role, policies in role_policies.items():
+        for env in environments:
+            evaluator = PolicyEvaluator(policies=list(policies))
+            decision = evaluator.evaluate({"tool_name": tool, "environment": env})
+            icon = "\u2705 allow " if decision.allowed else "\U0001f6ab deny  "
+
+            exp = intended.get((role, env, tool))
+            matrix_total += 1
+            if exp is not None and decision.allowed == exp:
+                matrix_pass += 1
+                print(f" {icon}     ", end="")
+            else:
+                surprises.append((role, env, tool, exp, decision))
+                print(f" {icon} \u26a0\ufe0f ", end="")
+    print()
+
+print()
+if surprises:
+    print(f"  {matrix_pass}/{matrix_total} cells match expectations.  "
+          f"{len(surprises)} surprise(s):\n")
+    for role, env, tool, exp, decision in surprises:
+        exp_label = "deny" if not exp else "allow"
+        act_label = "allow" if decision.allowed else "deny"
+        print(f"  \u26a0\ufe0f  {role} + {env} + {tool}")
+        print(f"     Expected: {exp_label}")
+        print(f"     Actual:   {act_label} (from rule: {decision.matched_rule or 'default'})")
+        print(f"     Reason:   {decision.reason}")
+        print()
+    print("  The reader policy blocks write_file at priority 80.")
+    print("  But the environment policy allows development at priority 90.")
+    print("  Priority 90 beats 80 \u2014 the environment rule fires first.")
+    print("  Without the test matrix, this interaction is invisible.")
+else:
+    print(f"  \u2705 {matrix_pass}/{matrix_total} cells match expectations")
+
+# ── Part 4: Catch a regression ────────────────────────────────────────
+
+print("\n--- Part 4: Catch a regression ---\n")
+
+print("  Scenario: someone edits the policy and removes the phrase")
+print('  "requires human approval" from the transfer_funds rule.')
+print("  The tool silently flips from escalate to hard deny.")
+print()
+
+# Deep-copy the policy and modify the message
+modified_policy = copy.deepcopy(policy)
+for rule in modified_policy.rules:
+    if rule.name == "escalate-transfer-funds":
+        rule.message = "Sensitive action: transfer_funds is blocked"
+        break
+
+# Evaluate transfer_funds with the original and modified policies
+original_eval = PolicyEvaluator(policies=[policy])
+modified_eval = PolicyEvaluator(policies=[modified_policy])
+
+orig_decision = original_eval.evaluate({"tool_name": "transfer_funds"})
+mod_decision = modified_eval.evaluate({"tool_name": "transfer_funds"})
+
+orig_tier, orig_icon = classify(orig_decision)
+mod_tier, mod_icon = classify(mod_decision)
+
+print(f"  Original policy:  transfer_funds \u2192 {orig_icon} ({orig_tier})")
+print(f"  Modified policy:  transfer_funds \u2192 {mod_icon} ({mod_tier})")
+print()
+
+if orig_tier != mod_tier:
+    print(f"  \u274c Regression detected!")
+    print(f"     transfer_funds changed from '{orig_tier}' to '{mod_tier}'.")
+    print(f"     The edit removed the escalation keyword, so the action")
+    print(f"     that used to pause for human review now silently blocks.")
+else:
+    print("  No regression (tiers match).")
+
+print()
+print("  A human scanning the YAML diff might miss this. But a test")
+print("  scenario that checks for the escalation keyword catches it")
+print("  instantly. That is the value of automated policy testing:")
+print("  changes that look harmless cannot silently break behavior.")
+
+print("\n" + "=" * 60)
+print("  Policies are code. Test them like code.")
+print("  Validate the structure, write expected outcomes,")
+print("  run them automatically, and catch regressions")
+print("  before they reach production.")
+print("=" * 60)
diff --git a/docs/tutorials/policy-as-code/examples/06_test_policy.yaml b/docs/tutorials/policy-as-code/examples/06_test_policy.yaml
new file mode 100644
index 00000000..f419340f
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/06_test_policy.yaml
@@ -0,0 +1,73 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+#
+# Chapter 6: Policy Testing — a combined policy for automated testing.
+#
+# This policy merges the governance concepts from Chapters 1-5 into a
+# single file so that test scenarios can verify every decision tier:
+#   - Always allowed  (search_documents)
+#   - Always denied   (delete_database)
+#   - Escalation      (transfer_funds, send_email)
+#   - Explicit deny   (write_file)
+#   - Default allow   (anything not listed)
+
+version: "1.0"
+name: test-policy
+description: >
+  Combined policy for automated testing.  Covers allow, deny,
+  escalation-tagged deny, and default-allow so that test scenarios
+  can verify every decision path in one pass.
+
+rules:
+  # Tier 1: Always denied — irreversibly destructive
+  - name: block-delete-database
+    condition:
+      field: tool_name
+      operator: eq
+      value: delete_database
+    action: deny
+    priority: 100
+    message: "Destructive action: deleting databases is never allowed"
+
+  # Tier 2: Escalation — needs human review
+  - name: escalate-transfer-funds
+    condition:
+      field: tool_name
+      operator: eq
+      value: transfer_funds
+    action: deny
+    priority: 90
+    message: "Sensitive action: transfer_funds requires human approval"
+
+  - name: escalate-send-email
+    condition:
+      field: tool_name
+      operator: eq
+      value: send_email
+    action: deny
+    priority: 85
+    message: "Sensitive action: send_email requires human approval"
+
+  # Tier 3: Always allowed — safe, read-only actions
+  - name: allow-search-documents
+    condition:
+      field: tool_name
+      operator: eq
+      value: search_documents
+    action: allow
+    priority: 80
+    message: "Safe action: searching documents is always allowed"
+
+  # Tier 4: Explicit deny — not needed by this agent
+  - name: block-write-file
+    condition:
+      field: tool_name
+      operator: eq
+      value: write_file
+    action: deny
+    priority: 70
+    message: "Write access is not permitted for this agent"
+
+defaults:
+  action: allow
+  max_tool_calls: 10
diff --git a/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml b/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml
new file mode 100644
index 00000000..940c8d45
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml
@@ -0,0 +1,47 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+#
+# Chapter 6: Policy Testing — declarative test scenarios.
+#
+# Each scenario describes one tool call and the expected outcome.
+# Run with:
+#   python -m agent_os.policies.cli test 06_test_policy.yaml 06_test_scenarios.yaml
+
+scenarios:
+  # ── Always allowed ──────────────────────────────────────────────────
+  - name: search-documents-allowed
+    context: { tool_name: search_documents }
+    expected_action: allow
+
+  # ── Always denied (destructive) ─────────────────────────────────────
+  - name: delete-database-denied
+    context: { tool_name: delete_database }
+    expected_action: deny
+
+  # ── Escalation-tagged (deny with "requires human approval") ─────────
+  - name: transfer-funds-denied
+    context: { tool_name: transfer_funds }
+    expected_action: deny
+
+  - name: send-email-denied
+    context: { tool_name: send_email }
+    expected_action: deny
+
+  # ── Explicit deny (not needed by this agent) ────────────────────────
+  - name: write-file-denied
+    context: { tool_name: write_file }
+    expected_action: deny
+
+  # ── Default action (tool not mentioned in any rule) ─────────────────
+  - name: unknown-tool-uses-default
+    context: { tool_name: read_logs }
+    expected_action: allow
+
+  # ── Same checks using expected_allowed (boolean) ────────────────────
+  - name: search-documents-is-allowed
+    context: { tool_name: search_documents }
+    expected_allowed: true
+
+  - name: delete-database-is-not-allowed
+    context: { tool_name: delete_database }
+    expected_allowed: false
diff --git a/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml b/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml
new file mode 100644
index 00000000..84a0e56b
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml
@@ -0,0 +1,67 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+#
+# Chapter 7: Policy Versioning — version 1.0 baseline.
+#
+# This is the same combined policy from Chapter 6, representing
+# the current production state before any updates.
+
+version: "1.0"
+name: production-policy
+description: >
+  Company-wide production policy — version 1.0 baseline.
+  Covers allow, deny, and escalation tiers for all five tools.
+
+rules:
+  # Tier 1: Always denied — irreversibly destructive
+  - name: block-delete-database
+    condition:
+      field: tool_name
+      operator: eq
+      value: delete_database
+    action: deny
+    priority: 100
+    message: "Destructive action: deleting databases is never allowed"
+
+  # Tier 2: Escalation — needs human review
+  - name: escalate-transfer-funds
+    condition:
+      field: tool_name
+      operator: eq
+      value: transfer_funds
+    action: deny
+    priority: 90
+    message: "Sensitive action: transfer_funds requires human approval"
+
+  - name: escalate-send-email
+    condition:
+      field: tool_name
+      operator: eq
+      value: send_email
+    action: deny
+    priority: 85
+    message: "Sensitive action: send_email requires human approval"
+
+  # Tier 3: Always allowed — safe, read-only actions
+  - name: allow-search-documents
+    condition:
+      field: tool_name
+      operator: eq
+      value: search_documents
+    action: allow
+    priority: 80
+    message: "Safe action: searching documents is always allowed"
+
+  # Tier 4: Explicit deny — restricted access
+  - name: block-write-file
+    condition:
+      field: tool_name
+      operator: eq
+      value: write_file
+    action: deny
+    priority: 70
+    message: "Write access is not permitted for this agent"
+
+defaults:
+  action: allow
+  max_tool_calls: 10
diff --git a/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml b/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml
new file mode 100644
index 00000000..00ace215
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml
@@ -0,0 +1,75 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+#
+# Chapter 7: Policy Versioning — version 2.0 update.
+#
+# Changes from v1:
+#   1. Version bumped from 1.0 to 2.0
+#   2. block-write-file priority raised from 70 to 95
+#      (fixes the Chapter 6 surprise where environment policy
+#       at priority 90 overrode the block at priority 70)
+#   3. escalate-send-email converted to hard deny — legal decided
+#      send_email should be fully blocked, not escalated
+#   4. escalate-transfer-funds message accidentally edited
+#      (removes "requires human approval" — introduces a regression)
+
+version: "2.0"
+name: production-policy
+description: >
+  Company-wide production policy — version 2.0 update.
+  Blocks send_email outright, fixes write_file priority.
+
+rules:
+  # Tier 1: Always denied — irreversibly destructive
+  - name: block-delete-database
+    condition:
+      field: tool_name
+      operator: eq
+      value: delete_database
+    action: deny
+    priority: 100
+    message: "Destructive action: deleting databases is never allowed"
+
+  # Tier 2: Escalation — needs human review
+  - name: escalate-transfer-funds
+    condition:
+      field: tool_name
+      operator: eq
+      value: transfer_funds
+    action: deny
+    priority: 90
+    message: "Sensitive action: transfer_funds is blocked"
+
+  # Changed: send_email is now a hard deny, no longer escalated
+  - name: escalate-send-email
+    condition:
+      field: tool_name
+      operator: eq
+      value: send_email
+    action: deny
+    priority: 85
+    message: "Communication: send_email is blocked by policy"
+
+  # Tier 3: Always allowed — safe, read-only actions
+  - name: allow-search-documents
+    condition:
+      field: tool_name
+      operator: eq
+      value: search_documents
+    action: allow
+    priority: 80
+    message: "Safe action: searching documents is always allowed"
+
+  # Tier 4: Explicit deny — priority raised to beat environment rules
+  - name: block-write-file
+    condition:
+      field: tool_name
+      operator: eq
+      value: write_file
+    action: deny
+    priority: 95
+    message: "Write access is not permitted for this agent"
+
+defaults:
+  action: allow
+  max_tool_calls: 10
diff --git a/docs/tutorials/policy-as-code/examples/07_policy_versioning.py b/docs/tutorials/policy-as-code/examples/07_policy_versioning.py
new file mode 100644
index 00000000..8526de89
--- /dev/null
+++ b/docs/tutorials/policy-as-code/examples/07_policy_versioning.py
@@ -0,0 +1,196 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+"""
+Chapter 7: Policy Versioning — Compare, Test, and Catch Regressions
+
+Shows how to diff two policy versions, test both with the same contexts,
+and detect regressions before deploying the new version.
+
+Run from the repo root:
+    pip install agent-os-kernel[full]
+    python docs/tutorials/policy-as-code/examples/07_policy_versioning.py
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+# Allow running from the repo root without installing the packages.
+_REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent
+sys.path.insert(0, str(_REPO_ROOT / "packages" / "agent-os" / "src"))
+
+from agent_os.policies import PolicyEvaluator
+from agent_os.policies.schema import PolicyDocument
+
+EXAMPLES_DIR = Path(__file__).parent
+
+ESCALATION_KEYWORD = "requires human approval"
+
+
+def classify(decision):
+    """Classify a PolicyDecision into allow / escalate / deny."""
+    if decision.allowed:
+        return ("allow", "\u2705 allow   ")
+    if decision.reason and ESCALATION_KEYWORD in decision.reason.lower():
+        return ("escalate", "\u23f3 escalate")
+    return ("deny", "\U0001f6ab deny    ")
+
+
+def diff_rules(v1_doc, v2_doc):
+    """Compare two PolicyDocuments rule-by-rule. Return a list of change strings."""
+    diffs = []
+
+    # Top-level fields
+    if v1_doc.version != v2_doc.version:
+        diffs.append(f"version: {v1_doc.version} \u2192 {v2_doc.version}")
+
+    # Index rules by name
+    v1_rules = {r.name: r for r in v1_doc.rules}
+    v2_rules = {r.name: r for r in v2_doc.rules}
+
+    for name in v2_rules:
+        if name not in v1_rules:
+            diffs.append(f"rule added: {name}")
+
+    for name in v1_rules:
+        if name not in v2_rules:
+            diffs.append(f"rule removed: {name}")
+
+    for name in v1_rules:
+        if name not in v2_rules:
+            continue
+        r1 = v1_rules[name]
+        r2 = v2_rules[name]
+        if r1.priority != r2.priority:
+            diffs.append(f"rule {name}: priority {r1.priority} \u2192 {r2.priority}")
+        if r1.message != r2.message:
+            diffs.append(f"rule {name}: message changed")
+            diffs.append(f"  was: \"{r1.message}\"")
+            diffs.append(f"  now: \"{r2.message}\"")
+        if r1.action != r2.action:
+            diffs.append(f"rule {name}: action {r1.action.value} \u2192 {r2.action.value}")
+
+    # Defaults
+    if v1_doc.defaults.action != v2_doc.defaults.action:
+        diffs.append(f"defaults: action {v1_doc.defaults.action.value} \u2192 {v2_doc.defaults.action.value}")
+    if v1_doc.defaults.max_tool_calls != v2_doc.defaults.max_tool_calls:
+        diffs.append(f"defaults: max_tool_calls {v1_doc.defaults.max_tool_calls} \u2192 {v2_doc.defaults.max_tool_calls}")
+
+    return diffs
+
+
+# ── Part 1: Load both versions ────────────────────────────────────────
+
+print("=" * 60)
+print("  Chapter 7: Policy Versioning")
+print("=" * 60)
+
+print("\n--- Part 1: Load both versions ---\n")
+
+v1 = PolicyDocument.from_yaml(EXAMPLES_DIR / "07_policy_v1.yaml")
+v2 = PolicyDocument.from_yaml(EXAMPLES_DIR / "07_policy_v2.yaml")
+
+print(f"  v1: '{v1.name}' version {v1.version}  ({len(v1.rules)} rules)")
+print(f"  v2: '{v2.name}' version {v2.version}  ({len(v2.rules)} rules)")
+
+# ── Part 2: Diff ──────────────────────────────────────────────────────
+
+print("\n--- Part 2: Diff the two versions ---\n")
+
+changes = diff_rules(v1, v2)
+
+if not changes:
+    print("  No differences found.")
+else:
+    for change in changes:
+        print(f"  {change}")
+
+print()
+print("  The diff lists every structural change. But a diff cannot")
+print("  tell you whether a change is safe. You need to test both")
+print("  versions and compare the results.")
+
+# ── Part 3: Test both versions ────────────────────────────────────────
+
+print("\n--- Part 3: Test both versions ---\n")
+
+eval_v1 = PolicyEvaluator(policies=[v1])
+eval_v2 = PolicyEvaluator(policies=[v2])
+
+tools = [
+    "search_documents",
+    "write_file",
+    "send_email",
+    "delete_database",
+    "transfer_funds",
+]
+
+print(f"  {'Tool':<22s} {'v1':<14s} {'v2':<14s} Changed?")
+print(f"  {'-' * 58}")
+
+results = []
+
+for tool in tools:
+    context = {"tool_name": tool}
+
+    d1 = eval_v1.evaluate(context)
+    d2 = eval_v2.evaluate(context)
+
+    tier1, icon1 = classify(d1)
+    tier2, icon2 = classify(d2)
+
+    changed = tier1 != tier2
+    flag = "\u26a0\ufe0f  yes" if changed else ""
+
+    print(f"  {tool:<22s} {icon1:<14s} {icon2:<14s} {flag}")
+    results.append((tool, tier1, tier2, changed))
+
+changed_count = sum(1 for _, _, _, c in results if c)
+print()
+if changed_count == 0:
+    print("  \u2705 No behavioral changes between v1 and v2.")
+else:
+    print(f"  {changed_count} tool(s) changed behavior between versions.")
+
+# ── Part 4: Detect regressions ────────────────────────────────────────
+
+print("\n--- Part 4: Detect regressions ---\n")
+
+# The team planned two changes in v2:
+#   - block-write-file priority raised (structural, no behavioral change here)
+#   - send_email converted from escalation to hard deny (legal decision)
+# Anything else that changed is a regression.
+expected_changes = {"send_email"}
+
+regressions = []
+
+for tool, tier1, tier2, changed in results:
+    if not changed:
+        continue
+    if tool in expected_changes:
+        print(f"  \u2705 {tool}: {tier1} \u2192 {tier2} (expected \u2014 legal decision)")
+    else:
+        print(f"  \u274c {tool}: {tier1} \u2192 {tier2} (REGRESSION)")
+        regressions.append((tool, tier1, tier2))
+
+if not regressions:
+    print()
+    print("  \u2705 All changes are expected. Safe to deploy v2.")
+else:
+    print()
+    for tool, old, new in regressions:
+        print(f"  \u274c Regression: {tool}")
+        print(f"     Was '{old}' in v1, now '{new}' in v2.")
+        print(f"     The v2 edit removed the escalation keyword from the")
+        print(f"     message, so the action that used to pause for human")
+        print(f"     review now silently blocks.")
+    print()
+    print("  Fix the regression in v2, then re-run this comparison.")
+    print("  Do not deploy until all changes are expected.")
+
+print("\n" + "=" * 60)
+print("  Policy versioning closes the loop.")
+print("  Tag a version, diff it, test both, catch regressions.")
+print("  No policy update ships without passing this check.")
+print("=" * 60)