diff --git a/docs/tutorials/policy-as-code/01-your-first-policy.md b/docs/tutorials/policy-as-code/01-your-first-policy.md index 17c03462..8ee72bbf 100644 --- a/docs/tutorials/policy-as-code/01-your-first-policy.md +++ b/docs/tutorials/policy-as-code/01-your-first-policy.md @@ -1,3 +1,6 @@ + + + # Chapter 1: Your First Policy In this chapter you will write a YAML policy file that blocks dangerous agent diff --git a/docs/tutorials/policy-as-code/02-capability-scoping.md b/docs/tutorials/policy-as-code/02-capability-scoping.md index 246dab74..241752d0 100644 --- a/docs/tutorials/policy-as-code/02-capability-scoping.md +++ b/docs/tutorials/policy-as-code/02-capability-scoping.md @@ -1,3 +1,6 @@ + + + # Chapter 2: Capability Scoping In Chapter 1 you created a single policy that applies to every agent. In diff --git a/docs/tutorials/policy-as-code/03-rate-limiting.md b/docs/tutorials/policy-as-code/03-rate-limiting.md index 66a3fa31..5da0e75e 100644 --- a/docs/tutorials/policy-as-code/03-rate-limiting.md +++ b/docs/tutorials/policy-as-code/03-rate-limiting.md @@ -1,3 +1,6 @@ + + + # Chapter 3: Rate Limiting An agent with the right permissions can still cause problems if it runs out of diff --git a/docs/tutorials/policy-as-code/04-conditional-policies.md b/docs/tutorials/policy-as-code/04-conditional-policies.md index 1e3b4097..2f4bea17 100644 --- a/docs/tutorials/policy-as-code/04-conditional-policies.md +++ b/docs/tutorials/policy-as-code/04-conditional-policies.md @@ -1,3 +1,6 @@ + + + # Chapter 4: Conditional Policies In Chapters 1-3 each policy stood on its own — one file, one evaluator, one diff --git a/docs/tutorials/policy-as-code/05-approval-workflows.md b/docs/tutorials/policy-as-code/05-approval-workflows.md index c78dea20..b3a3e843 100644 --- a/docs/tutorials/policy-as-code/05-approval-workflows.md +++ b/docs/tutorials/policy-as-code/05-approval-workflows.md @@ -463,5 +463,5 @@ every tool gets the right decision in every environment. That is policy testing. **Previous:** [Chapter 4 — Conditional Policies](04-conditional-policies.md) -**Next:** Chapter 6 — Policy Testing (coming soon) — verify that every +**Next:** [Chapter 6 — Policy Testing](06-policy-testing.md) — verify that every policy rule works correctly, automatically. diff --git a/docs/tutorials/policy-as-code/06-policy-testing.md b/docs/tutorials/policy-as-code/06-policy-testing.md new file mode 100644 index 00000000..d7568a1c --- /dev/null +++ b/docs/tutorials/policy-as-code/06-policy-testing.md @@ -0,0 +1,604 @@ + + + +# Chapter 6: Policy Testing + +In Chapters 1–5, you checked your policies by running a script and eyeballing +the output. That works when you have five rules. But you now have role-based +policies, environment-aware rules, conflict resolution, and escalation +workflows. A single typo in a YAML file can silently change an escalation into +a hard deny — and nobody notices until a real transfer fails in production. + +Manual checking does not scale. You need **automated tests** that verify every +tool gets the right decision, for every role, every time. + +**What you'll learn:** + +| Section | Topic | +|---------|-------| +| [The problem](#the-problem) | Why eyeballing output is not enough | +| [Validate the structure](#step-1-validate-the-structure) | Catch structural errors before anything runs | +| [Write test scenarios](#step-2-write-test-scenarios) | Declare expected outcomes, run them automatically | +| [The test matrix](#step-3-the-test-matrix) | Combine policies from chapters 2 + 4, test every role × environment × tool | +| [Catch a regression](#step-4-catch-a-regression) | Find the bug that manual checking misses | +| [Try it yourself](#try-it-yourself) | Exercises | + +--- + +## The problem + +Manual checking breaks down fast. Once you have multiple policies, you need a +repeatable way to say "for this context, I expect this decision" and verify the +result automatically. + +--- + +## Step 1: Validate the structure + +Before testing any decisions, make sure the YAML is well-formed. A misspelled +operator or a missing field will cause confusing failures later. Catch +structural errors first. + +If you are using the checked-in example files from the repo root, use the full +paths shown in the commands below. If you created your own copies locally, +replace them with your local filenames. + +### A valid policy (`06_test_policy.yaml`) + +This policy combines concepts from earlier chapters — allow, deny, escalation, +and a default — into a single file designed for testing: + +```yaml +version: "1.0" +name: test-policy +description: > + Combined policy for automated testing. Covers allow, deny, + escalation-tagged deny, and default-allow so that test scenarios + can verify every decision path in one pass. + +rules: + # Tier 1: Always denied — irreversibly destructive + - name: block-delete-database + condition: + field: tool_name + operator: eq + value: delete_database + action: deny + priority: 100 + message: "Destructive action: deleting databases is never allowed" + + # Tier 2: Escalation — needs human review + - name: escalate-transfer-funds + condition: + field: tool_name + operator: eq + value: transfer_funds + action: deny + priority: 90 + message: "Sensitive action: transfer_funds requires human approval" + + - name: escalate-send-email + condition: + field: tool_name + operator: eq + value: send_email + action: deny + priority: 85 + message: "Sensitive action: send_email requires human approval" + + # Tier 3: Always allowed — safe, read-only actions + - name: allow-search-documents + condition: + field: tool_name + operator: eq + value: search_documents + action: allow + priority: 80 + message: "Safe action: searching documents is always allowed" + + # Tier 4: Explicit deny — not needed by this agent + - name: block-write-file + condition: + field: tool_name + operator: eq + value: write_file + action: deny + priority: 70 + message: "Write access is not permitted for this agent" + +defaults: + action: allow + max_tool_calls: 10 +``` + +Five rules, four decision tiers, one default. Enough to test every path. + +### Loading and validating + +```python +from pathlib import Path + +from agent_os.policies.schema import PolicyDocument + +examples_dir = Path("docs/tutorials/policy-as-code/examples") + +policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml") +print(policy.name) # "test-policy" +print(len(policy.rules)) # 5 +``` + +`PolicyDocument.from_yaml()` does two things: it parses the YAML and validates +it against the schema. If the file is valid, you get a `PolicyDocument` object. +If not, you get a `ValidationError` that tells you exactly what is wrong. + +### A broken policy + +What if someone types `equals` instead of `eq`? + +```python +from pydantic import ValidationError + +broken = { + "version": "1.0", + "name": "broken-policy", + "rules": [{ + "name": "bad-rule", + "condition": { + "field": "tool_name", + "operator": "equals", # wrong — should be "eq" + "value": "send_email", + }, + "action": "deny", + }], +} + +try: + PolicyDocument.model_validate(broken) +except ValidationError as exc: + print(exc.errors()[0]["msg"]) +``` + +### Example output + +``` + 🚫 Validation failed (as expected): + Field: rules -> 0 -> condition -> operator + Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains' +``` + +The error message tells you the exact path (`rules -> 0 -> condition -> +operator`) and the valid values. You do not need to guess. + +### Using the CLI + +The same validation is available as a command: + +```bash +python -m agent_os.policies.cli validate \ + docs/tutorials/policy-as-code/examples/06_test_policy.yaml +``` + +``` +OK +``` + +Exit code 0 means the file is valid. Exit code 1 means validation failed +(with the error printed to stderr). Exit code 2 means the file could not be +found or parsed. + +--- + +## Step 2: Write test scenarios + +Validation tells you the YAML is *structured correctly*. Test scenarios tell +you the policy *behaves correctly* — that each tool gets the right decision. + +### The scenarios file (`06_test_scenarios.yaml`) + +```yaml +scenarios: + # Always allowed + - name: search-documents-allowed + context: { tool_name: search_documents } + expected_action: allow + + # Always denied (destructive) + - name: delete-database-denied + context: { tool_name: delete_database } + expected_action: deny + + # Escalation-tagged (deny with "requires human approval") + - name: transfer-funds-denied + context: { tool_name: transfer_funds } + expected_action: deny + + - name: send-email-denied + context: { tool_name: send_email } + expected_action: deny + + # Explicit deny + - name: write-file-denied + context: { tool_name: write_file } + expected_action: deny + + # Default action (tool not in any rule) + - name: unknown-tool-uses-default + context: { tool_name: read_logs } + expected_action: allow + + # Same checks using expected_allowed (boolean) + - name: search-documents-is-allowed + context: { tool_name: search_documents } + expected_allowed: true + + - name: delete-database-is-not-allowed + context: { tool_name: delete_database } + expected_allowed: false +``` + +Each scenario names a context and an expected result. You can check either the +action string (`expected_action`) or the boolean (`expected_allowed`). + +### Running with the CLI + +```bash +python -m agent_os.policies.cli test \ + docs/tutorials/policy-as-code/examples/06_test_policy.yaml \ + docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml +``` + +``` +8/8 scenarios passed +``` + +If any scenario fails, the CLI prints which one and what went wrong: + +``` +FAIL: transfer-funds-denied: expected deny, got allow +7/8 scenarios passed +``` + +Exit code 0 means all passed. Exit code 1 means at least one failed. + +### Running in Python + +The CLI is convenient, but sometimes you want the results in Python — for +custom formatting, integration with a CI pipeline, or testing multiple +policies at once. + +```python +from pathlib import Path + +import yaml +from agent_os.policies import PolicyEvaluator +from agent_os.policies.schema import PolicyDocument + +examples_dir = Path("docs/tutorials/policy-as-code/examples") + +policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml") +evaluator = PolicyEvaluator(policies=[policy]) + +with open(examples_dir / "06_test_scenarios.yaml") as f: + scenarios = yaml.safe_load(f)["scenarios"] + +for scenario in scenarios: + decision = evaluator.evaluate(scenario["context"]) + expected = scenario.get("expected_action") + actual = decision.action + ok = (expected is None) or (actual == expected) + status = "✅ pass" if ok else "❌ FAIL" + print(f"{scenario['name']}: {status}") +``` + +### Example output + +``` + Scenario Expected Actual Result + -------------------------------------------------------------------- + search-documents-allowed allow allow ✅ pass + delete-database-denied deny deny ✅ pass + transfer-funds-denied deny deny ✅ pass + send-email-denied deny deny ✅ pass + write-file-denied deny deny ✅ pass + unknown-tool-uses-default allow allow ✅ pass + search-documents-is-allowed true true ✅ pass + delete-database-is-not-allowed false false ✅ pass + + ✅ 8/8 scenarios passed +``` + +--- + +## Step 3: The test matrix + +The scenarios in Step 2 test one policy in isolation. But in production, +**multiple policies apply at the same time**: the reader policy from Chapter 2 +and the environment policy from Chapter 4. When both are active, their rules +merge and interact. A rule from one policy can override a rule from another — +and the result might not be what anyone intended. + +A **test matrix** crosses every role, every environment, and every tool. It +tests the *combined system*, not individual pieces. + +### Building the combined system + +Load the role policies from Chapter 2 and the environment policy from Chapter +4. For each role, combine its policy with the shared environment policy: + +```python +from pathlib import Path + +from agent_os.policies import PolicyEvaluator +from agent_os.policies.schema import PolicyDocument + +examples_dir = Path("docs/tutorials/policy-as-code/examples") + +reader_policy = PolicyDocument.from_yaml(examples_dir / "02_reader_policy.yaml") +admin_policy = PolicyDocument.from_yaml(examples_dir / "02_admin_policy.yaml") +env_policy = PolicyDocument.from_yaml(examples_dir / "04_env_policy.yaml") + +# Each role gets its own policy + the shared environment policy. +# The evaluator merges all rules and sorts by priority. +role_policies = { + "reader": [reader_policy, env_policy], + "admin": [admin_policy, env_policy], +} + +tools = ["search_documents", "write_file", "send_email", + "delete_database", "transfer_funds"] +environments = ["development", "production"] + +for tool in tools: + for role, policies in role_policies.items(): + for env in environments: + evaluator = PolicyEvaluator(policies=list(policies)) + decision = evaluator.evaluate({"tool_name": tool, "environment": env}) + # check against expected ... +``` + +When two policies are loaded into one evaluator, their rules are merged into a +single list sorted by priority. The first rule that matches the context wins. +This is where surprising interactions happen. + +### Example output + +``` + Tool reader/dev reader/prod admin/dev admin/prod + ----------------------------------------------------------------------- + search_documents ✅ allow 🚫 deny ✅ allow 🚫 deny + write_file ✅ allow ⚠️ 🚫 deny ✅ allow 🚫 deny + send_email 🚫 deny 🚫 deny ✅ allow 🚫 deny + delete_database 🚫 deny 🚫 deny 🚫 deny 🚫 deny + transfer_funds ✅ allow 🚫 deny ✅ allow 🚫 deny + + 19/20 cells match expectations. 1 surprise: + + ⚠️ reader + development + write_file + Expected: deny (reader policy blocks write_file at priority 80) + Actual: allow (environment policy allows development at priority 90) + Reason: Development environment: agents can act freely +``` + +### What just happened? + +The matrix found a real interaction bug. `block-write-file` is priority 80, but +`allow-development` is priority 90, so the environment rule wins first and the +reader is allowed to write files in development. You would not catch that by +reading the YAML files one at a time. + +--- + +## Step 4: Catch a regression + +This is the payoff. Here is a bug that would be nearly invisible to a human +reviewer — but a test catches it instantly. + +### The scenario + +Someone edits the policy and changes the `transfer_funds` rule's message from: + +``` +"Sensitive action: transfer_funds requires human approval" +``` + +to: + +``` +"Sensitive action: transfer_funds is blocked" +``` + +The rule still says `action: deny`. Nothing else changed. A YAML diff shows +one line modified. A human reviewer might glance at it and approve. + +But in the code, the escalation system uses the phrase `"requires human +approval"` in the message to distinguish an escalation from a hard deny +(Chapter 5). Removing that phrase silently converts an escalation — where a +human could approve the transfer — into an unconditional block. + +### What the test shows + +``` + Original policy: transfer_funds → ⏳ escalate (escalate) + Modified policy: transfer_funds → 🚫 deny (deny) + + ❌ Regression detected! + transfer_funds changed from 'escalate' to 'deny'. + The edit removed the escalation keyword, so the action + that used to pause for human review now silently blocks. +``` + +The test compared the *classification* of the decision, not just the raw +action string. Both versions return `action: deny`, but only the original still +means "escalate." + +--- + +## Full example + +```bash +python docs/tutorials/policy-as-code/examples/06_policy_testing.py +``` + +``` +============================================================ + Chapter 6: Policy Testing +============================================================ + +--- Part 1: Validate the structure --- + + ✅ 'test-policy' loaded successfully + 5 rules, default action: allow + + 🚫 Validation failed (as expected): + Field: rules -> 0 -> condition -> operator + Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains' + + PolicyDocument.from_yaml() catches structural errors + before any rule is evaluated. A typo like 'equals' + instead of 'eq' is caught immediately. + +--- Part 2: Run test scenarios --- + + Scenario Expected Actual Result + -------------------------------------------------------------------- + search-documents-allowed allow allow ✅ pass + delete-database-denied deny deny ✅ pass + transfer-funds-denied deny deny ✅ pass + send-email-denied deny deny ✅ pass + write-file-denied deny deny ✅ pass + unknown-tool-uses-default allow allow ✅ pass + search-documents-is-allowed true true ✅ pass + delete-database-is-not-allowed false false ✅ pass + + ✅ 8/8 scenarios passed + + Each scenario is one line in a YAML file. The test runner + evaluates the policy and compares the actual result to the + expected result. No manual checking required. + +--- Part 3: The test matrix --- + + Loading policies from chapters 2 and 4... + + Tool reader/dev reader/prod admin/dev admin/prod + ----------------------------------------------------------------------- + search_documents ✅ allow 🚫 deny ✅ allow 🚫 deny + write_file ✅ allow ⚠️ 🚫 deny ✅ allow 🚫 deny + send_email 🚫 deny 🚫 deny ✅ allow 🚫 deny + delete_database 🚫 deny 🚫 deny 🚫 deny 🚫 deny + transfer_funds ✅ allow 🚫 deny ✅ allow 🚫 deny + + 19/20 cells match expectations. 1 surprise(s): + + ⚠️ reader + development + write_file + Expected: deny + Actual: allow (from rule: allow-development) + Reason: Development environment: agents can act freely + + The reader policy blocks write_file at priority 80. + But the environment policy allows development at priority 90. + Priority 90 beats 80 — the environment rule fires first. + Without the test matrix, this interaction is invisible. + +--- Part 4: Catch a regression --- + + Scenario: someone edits the policy and removes the phrase + "requires human approval" from the transfer_funds rule. + The tool silently flips from escalate to hard deny. + + Original policy: transfer_funds → ⏳ escalate (escalate) + Modified policy: transfer_funds → 🚫 deny (deny) + + ❌ Regression detected! + transfer_funds changed from 'escalate' to 'deny'. + The edit removed the escalation keyword, so the action + that used to pause for human review now silently blocks. + + A human scanning the YAML diff might miss this. But a test + scenario that checks for the escalation keyword catches it + instantly. That is the value of automated policy testing: + changes that look harmless cannot silently break behavior. + +============================================================ + Policies are code. Test them like code. + Validate the structure, write expected outcomes, + run them automatically, and catch regressions + before they reach production. +============================================================ +``` + +--- + +## How does it work? + +``` + Role policy Environment policy + (ch2) (ch4) + │ │ + └────────┬───────┘ + ▼ + ┌─────────────────────────────────┐ + │ 1. Validate each file │ + │ PolicyDocument.from_yaml() │ + └──────────┬──────────────────────┘ + ▼ + ┌─────────────────────────────────┐ + │ 2. Test each policy alone │ + │ CLI: policy test │ + └──────────┬──────────────────────┘ + ▼ + ┌─────────────────────────────────┐ + │ 3. Test the combined system │ + │ Python: multi-policy eval │ + └──────────┬──────────────────────┘ + │ + ┌──────┴──────┐ + ▼ ▼ + All pass Surprises found + ✅ Deploy ❌ Fix and re-run +``` + +| Tool | What it does | +|------|-------------| +| `PolicyDocument.from_yaml(path)` | Load YAML and validate against Pydantic schema | +| `PolicyDocument.model_validate(dict)` | Validate a Python dict without loading a file | +| `PolicyEvaluator(policies=[...])` | Merge rules from multiple policies | +| `evaluator.evaluate(context)` | Return a `PolicyDecision` with `allowed`, `action`, `reason` | +| `policy validate ` | CLI: validate structure, print OK or FAIL | +| `policy test ` | CLI: run scenarios, print pass count | + +--- + +## Try it yourself + +1. **Fix the surprise.** The test matrix found that `reader + development + + write_file` is unexpectedly allowed. Edit `02_reader_policy.yaml` and + raise `block-write-file`'s priority to 95 (above the environment policy's + 90). Re-run the script — the ⚠️ should disappear. + +2. **Add a staging environment.** The environment policy has rules for + development and production, but not staging. Add `staging` to the + environments list in the test matrix. What happens? Does the default deny + or allow? Add a scenario to verify. + +3. **Extend the matrix.** Create a third policy file for an "operator" role + that can search documents and send emails but cannot write files or delete + databases. Add it to the Python test matrix and verify the results across + all environments. + +--- + +## What's missing? + +Policies change over time. Legal tells you that `write_file` must now be +blocked in production, not just for readers. The policy needs to be updated +from version 1.0 to version 2.0. But how do you make that change without +accidentally breaking something that was already working? + +You need a way to **compare two versions** side by side — see exactly what +changed, run the test suite against *both* versions, and find regressions +before the new version goes live. That is policy versioning. + +**Previous:** [Chapter 5 — Approval Workflows](05-approval-workflows.md) +**Next:** [Chapter 7 — Policy Versioning](07-policy-versioning.md) — compare +v1 vs v2 behavior, catch regressions before deploying. diff --git a/docs/tutorials/policy-as-code/07-policy-versioning.md b/docs/tutorials/policy-as-code/07-policy-versioning.md new file mode 100644 index 00000000..5f63bd19 --- /dev/null +++ b/docs/tutorials/policy-as-code/07-policy-versioning.md @@ -0,0 +1,317 @@ + + + +# Chapter 7: Policy Versioning + +Chapter 6 proved that your policies work *right now*. But policies change. +Legal tells you that `send_email` should be a hard block, not an escalation. +Someone fixes that — and accidentally breaks `transfer_funds` in the same +edit. You need a way to compare two versions, test both, and catch the +regression before the new version goes live. + +**What you'll learn:** + +| Section | Topic | +|---------|-------| +| [Two versions side by side](#step-1-two-versions-side-by-side) | What changed between v1 and v2 | +| [Diff with the CLI](#step-2-diff-with-the-cli) | See every structural change in one command | +| [Test both versions](#step-3-test-both-versions) | Run the same contexts against v1 and v2 | +| [Catch the regression](#step-4-catch-the-regression) | Separate expected changes from accidents | + +--- + +## Step 1: Two versions side by side + +Version 1.0 is the production baseline — the same combined policy from +Chapter 6 with five rules covering all decision tiers. + +Version 2.0 has three changes: + +| # | Change | Intentional? | +|---|--------|-------------| +| 1 | `block-write-file` priority raised from 70 to 95 | Yes — fixes the Chapter 6 surprise where the environment policy overrode the block | +| 2 | `escalate-send-email` message no longer says "requires human approval" | Yes — legal decided send_email should be fully blocked | +| 3 | `escalate-transfer-funds` message no longer says "requires human approval" | No — accidental edit, breaks the escalation | + +Changes 1 and 2 are intentional. Change 3 happened because someone edited +both escalation rules instead of just one. The YAML diff looks like a +routine cleanup. The damage is invisible without a behavioral test. + +--- + +## Step 2: Diff with the CLI + +The built-in `diff` command compares two policy files structurally: + +```bash +python -m agent_os.policies.cli diff \ + examples/07_policy_v1.yaml \ + examples/07_policy_v2.yaml +``` + +``` +rule escalate-transfer-funds: message: Sensitive action: transfer_funds requires human approval -> Sensitive action: transfer_funds is blocked +rule escalate-send-email: message: Sensitive action: send_email requires human approval -> Communication: send_email is blocked by policy +rule block-write-file: priority: 70 -> 95 +version: 1.0 -> 2.0 +``` + +Every structural change is listed: two messages changed, one priority +raised, and the version bumped. But the diff does not tell you which change +breaks behavior. For that, you need to run both versions through the same +tests. + +--- + +## Step 3: Test both versions + +Load v1 and v2 into separate evaluators and run the same five tools through +both. Use the `classify()` helper from Chapter 6 to tag each result as +allow, escalate, or deny: + +```python +from pathlib import Path + +from agent_os.policies import PolicyEvaluator +from agent_os.policies.schema import PolicyDocument + +examples_dir = Path("docs/tutorials/policy-as-code/examples") + +v1 = PolicyDocument.from_yaml(examples_dir / "07_policy_v1.yaml") +v2 = PolicyDocument.from_yaml(examples_dir / "07_policy_v2.yaml") + +eval_v1 = PolicyEvaluator(policies=[v1]) +eval_v2 = PolicyEvaluator(policies=[v2]) + +ESCALATION_KEYWORD = "requires human approval" + +def classify(decision): + if decision.allowed: + return "allow" + if decision.reason and ESCALATION_KEYWORD in decision.reason.lower(): + return "escalate" + return "deny" + +tools = ["search_documents", "write_file", "send_email", + "delete_database", "transfer_funds"] + +for tool in tools: + ctx = {"tool_name": tool} + t1 = classify(eval_v1.evaluate(ctx)) + t2 = classify(eval_v2.evaluate(ctx)) + changed = "⚠️" if t1 != t2 else "" + print(f"{tool:<22s} {t1:<12s} {t2:<12s} {changed}") +``` + +### Example output + +``` + Tool v1 v2 Changed? + ---------------------------------------------------------- + search_documents ✅ allow ✅ allow + write_file 🚫 deny 🚫 deny + send_email ⏳ escalate 🚫 deny ⚠️ yes + delete_database 🚫 deny 🚫 deny + transfer_funds ⏳ escalate 🚫 deny ⚠️ yes + + 2 tool(s) changed behavior between versions. +``` + +Two tools changed: `send_email` and `transfer_funds`. Both went from +escalate to deny. The structural diff showed three changes, but the +behavioral test shows only two matter. The `write_file` priority change +does not affect single-policy evaluation — it matters when combined with +the environment policy (that is what the Chapter 6 test matrix would +catch). + +--- + +## Step 4: Catch the regression + +The team planned one behavioral change: `send_email` should become a hard +deny. Anything else that changed is a regression. + +```python +expected_changes = {"send_email"} + +for tool, tier1, tier2, changed in results: + if not changed: + continue + if tool in expected_changes: + print(f"✅ {tool}: {tier1} → {tier2} (expected)") + else: + print(f"❌ {tool}: {tier1} → {tier2} (REGRESSION)") +``` + +``` + ✅ send_email: escalate → deny (expected — legal decision) + ❌ transfer_funds: escalate → deny (REGRESSION) + + ❌ Regression: transfer_funds + Was 'escalate' in v1, now 'deny' in v2. + The v2 edit removed the escalation keyword from the + message, so the action that used to pause for human + review now silently blocks. + + Fix the regression in v2, then re-run this comparison. + Do not deploy until all changes are expected. +``` + +The regression is the same type Chapter 6 caught in Part 4 — removing +`"requires human approval"` silently converts an escalation into a hard +deny. But this time, the test compares *two versions* instead of checking +one version in isolation. That is what makes it a versioning check: you can +see exactly when the behavior changed and which edit caused it. + +--- + +## Full example + +```bash +python docs/tutorials/policy-as-code/examples/07_policy_versioning.py +``` + +``` +============================================================ + Chapter 7: Policy Versioning +============================================================ + +--- Part 1: Load both versions --- + + v1: 'production-policy' version 1.0 (5 rules) + v2: 'production-policy' version 2.0 (5 rules) + +--- Part 2: Diff the two versions --- + + version: 1.0 → 2.0 + rule escalate-transfer-funds: message changed + was: "Sensitive action: transfer_funds requires human approval" + now: "Sensitive action: transfer_funds is blocked" + rule escalate-send-email: message changed + was: "Sensitive action: send_email requires human approval" + now: "Communication: send_email is blocked by policy" + rule block-write-file: priority 70 → 95 + + The diff lists every structural change. But a diff cannot + tell you whether a change is safe. You need to test both + versions and compare the results. + +--- Part 3: Test both versions --- + + Tool v1 v2 Changed? + ---------------------------------------------------------- + search_documents ✅ allow ✅ allow + write_file 🚫 deny 🚫 deny + send_email ⏳ escalate 🚫 deny ⚠️ yes + delete_database 🚫 deny 🚫 deny + transfer_funds ⏳ escalate 🚫 deny ⚠️ yes + + 2 tool(s) changed behavior between versions. + +--- Part 4: Detect regressions --- + + ✅ send_email: escalate → deny (expected — legal decision) + ❌ transfer_funds: escalate → deny (REGRESSION) + + ❌ Regression: transfer_funds + Was 'escalate' in v1, now 'deny' in v2. + The v2 edit removed the escalation keyword from the + message, so the action that used to pause for human + review now silently blocks. + + Fix the regression in v2, then re-run this comparison. + Do not deploy until all changes are expected. + +============================================================ + Policy versioning closes the loop. + Tag a version, diff it, test both, catch regressions. + No policy update ships without passing this check. +============================================================ +``` + +--- + +## How does it work? + +``` + v1.yaml v2.yaml + │ │ + └────────┬───────┘ + ▼ + ┌───────────────────────────┐ + │ 1. Diff │ + │ CLI: policy diff │ + │ List structural diffs │ + └──────────┬────────────────┘ + ▼ + ┌───────────────────────────┐ + │ 2. Test both │ + │ Same contexts, same │ + │ classify() function │ + └──────────┬────────────────┘ + │ + ┌──────┴──────┐ + ▼ ▼ + No changes Changes found + ✅ Safe to ↓ + deploy ┌──────────────┐ + │ 3. Classify │ + │ Expected vs │ + │ Regression │ + └──────┬───────┘ + │ + ┌──────┴──────┐ + ▼ ▼ + Expected Regression + ✅ Deploy ❌ Fix first +``` + +| Tool | What it does | +|------|-------------| +| `policy diff v1.yaml v2.yaml` | CLI: structural diff between two policy files | +| `PolicyDocument.from_yaml(path)` | Load and validate a policy file | +| `PolicyEvaluator(policies=[doc])` | Create an evaluator from a PolicyDocument | +| `evaluator.evaluate(context)` | Return a `PolicyDecision` with `allowed`, `action`, `reason` | +| `classify(decision)` | Tag a decision as allow, escalate, or deny (from Chapter 6) | + +--- + +## Try it yourself + +1. **Add a new rule in v2.** Create a rule `block-execute-code` that denies + `execute_code` in v2 only. Re-run the diff — it should show "rule + added." Test both versions to confirm the new rule only affects v2, and + add it to `expected_changes` so it does not flag as a regression. + +2. **Bridge conversion.** Import `governance_to_document` from + `agent_os.policies.bridge` and convert a `GovernancePolicy` object + into a `PolicyDocument`. Diff the result against v1 to see how the + legacy format maps to the declarative format. + +3. **Automate the gate.** Write a function `is_safe_to_deploy(v1_path, + v2_path, expected)` that loads both files, diffs them, tests both, + and returns `True` only if every behavioral change is in the + `expected` set. This is a deploy gate — run it in CI before any policy + update ships. + +--- + +## What you've built + +Over seven chapters, you built a complete policy governance system: + +| Chapter | Layer | +|---------|-------| +| 1 | Block dangerous tools | +| 2 | Scope permissions by role | +| 3 | Rate-limit actions | +| 4 | Resolve conflicts between policies | +| 5 | Escalate sensitive actions to humans | +| 6 | Test policies automatically | +| 7 | Update policies safely with regression detection | + +Each layer added one concept. Together, they form a system that can +govern AI agents in production: who can do what, how often, who approves, +how you test it, and how you update it without breaking what already works. + +**Previous:** [Chapter 6 — Policy Testing](06-policy-testing.md) diff --git a/docs/tutorials/policy-as-code/README.md b/docs/tutorials/policy-as-code/README.md index 68fef7c3..9f34579a 100644 --- a/docs/tutorials/policy-as-code/README.md +++ b/docs/tutorials/policy-as-code/README.md @@ -20,10 +20,8 @@ pip install agent-os-kernel[full] | [03 — Rate Limiting](03-rate-limiting.md) | Preventing runaway agents | Set limits on how many actions an agent can take | | [04 — Conditional Policies](04-conditional-policies.md) | Policy composition and conflict resolution | Layer base + environment policies with conflict strategies | | [05 — Approval Workflows](05-approval-workflows.md) | Human-in-the-loop for sensitive actions | Route dangerous actions to a human before execution | -| 06 — Policy Testing | Systematic validation with test matrices | Test every role + action + environment combination | -| 07 — Policy Versioning | Safe rollout of policy changes | Compare v1 vs v2 behavior, catch regressions before deploying | - -> Chapters 06–07 are coming soon. +| [06 — Policy Testing](06-policy-testing.md) | Systematic validation with test matrices | Test every role + action + environment combination | +| [07 — Policy Versioning](07-policy-versioning.md) | Safe rollout of policy changes | Compare v1 vs v2 behavior, catch regressions before deploying | ## Running Examples diff --git a/docs/tutorials/policy-as-code/examples/01_first_policy.yaml b/docs/tutorials/policy-as-code/examples/01_first_policy.yaml index a970afa4..70a9e86b 100644 --- a/docs/tutorials/policy-as-code/examples/01_first_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/01_first_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: my-first-policy description: A simple policy that blocks dangerous agent actions diff --git a/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml b/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml index eb2598af..6a53a1c3 100644 --- a/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/02_admin_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: admin-policy description: Permissive policy for admin agents — only the most dangerous actions are blocked diff --git a/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml b/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml index 3b742e91..92f4c648 100644 --- a/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/02_reader_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: reader-policy description: Restrictive policy for read-only agents diff --git a/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml b/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml index 71555815..f8d30b09 100644 --- a/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/03_rate_limit_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: rate-limit-policy description: Policy that limits how many tool calls an agent can make diff --git a/docs/tutorials/policy-as-code/examples/04_env_policy.yaml b/docs/tutorials/policy-as-code/examples/04_env_policy.yaml index d3c41812..4430c633 100644 --- a/docs/tutorials/policy-as-code/examples/04_env_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/04_env_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: environment-policy description: Rules that change based on the deployment environment diff --git a/docs/tutorials/policy-as-code/examples/04_global_policy.yaml b/docs/tutorials/policy-as-code/examples/04_global_policy.yaml index f638dde2..137d5c17 100644 --- a/docs/tutorials/policy-as-code/examples/04_global_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/04_global_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: global-security-policy description: Company-wide rules set by the security team diff --git a/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml b/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml index 10c3bb22..07aa0fba 100644 --- a/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml +++ b/docs/tutorials/policy-as-code/examples/04_support_team_policy.yaml @@ -1,3 +1,6 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + version: "1.0" name: support-team-policy description: Rules for the customer support team's agent diff --git a/docs/tutorials/policy-as-code/examples/06_policy_testing.py b/docs/tutorials/policy-as-code/examples/06_policy_testing.py new file mode 100644 index 00000000..d4091bee --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/06_policy_testing.py @@ -0,0 +1,308 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +""" +Chapter 6: Policy Testing — Automated Validation and Test Scenarios + +Shows how to validate policy structure, run declarative test scenarios, +build a role-by-tool test matrix, and catch regressions automatically. + +Run from the repo root: + pip install agent-os-kernel[full] + python docs/tutorials/policy-as-code/examples/06_policy_testing.py +""" + +from __future__ import annotations + +import copy +import sys +from pathlib import Path + +import yaml + +# Allow running from the repo root without installing the packages. +_REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent +sys.path.insert(0, str(_REPO_ROOT / "packages" / "agent-os" / "src")) + +from pydantic import ValidationError + +from agent_os.policies import PolicyEvaluator +from agent_os.policies.schema import PolicyDocument + +EXAMPLES_DIR = Path(__file__).parent + +ESCALATION_KEYWORD = "requires human approval" + + +def classify(decision): + """Classify a PolicyDecision into allow / escalate / deny.""" + if decision.allowed: + return ("allow", "\u2705 allow ") + if decision.reason and ESCALATION_KEYWORD in decision.reason.lower(): + return ("escalate", "\u23f3 escalate") + return ("deny", "\U0001f6ab deny ") + + +# ── Part 1: Validate the structure ──────────────────────────────────── + +print("=" * 60) +print(" Chapter 6: Policy Testing") +print("=" * 60) + +print("\n--- Part 1: Validate the structure ---\n") + +# 1a — Load a valid policy +policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "06_test_policy.yaml") +print(f" \u2705 '{policy.name}' loaded successfully") +print(f" {len(policy.rules)} rules, default action: {policy.defaults.action.value}") + +# 1b — Try to validate a broken policy +print() +broken_data = { + "version": "1.0", + "name": "broken-policy", + "rules": [ + { + "name": "bad-rule", + "condition": { + "field": "tool_name", + "operator": "equals", # wrong — should be "eq" + "value": "send_email", + }, + "action": "deny", + } + ], +} + +try: + PolicyDocument.model_validate(broken_data) + print(" Unexpected: broken policy passed validation") +except ValidationError as exc: + # Show only the first error for readability + first_error = exc.errors()[0] + print(f" \U0001f6ab Validation failed (as expected):") + print(f" Field: {' -> '.join(str(p) for p in first_error['loc'])}") + print(f" Problem: {first_error['msg']}") + +print() +print(" PolicyDocument.from_yaml() catches structural errors") +print(" before any rule is evaluated. A typo like 'equals'") +print(" instead of 'eq' is caught immediately.") + +# ── Part 2: Run test scenarios ──────────────────────────────────────── + +print("\n--- Part 2: Run test scenarios ---\n") + +# Load the scenarios file +scenarios_path = EXAMPLES_DIR / "06_test_scenarios.yaml" +with open(scenarios_path) as f: + scenarios_data = yaml.safe_load(f) + +scenarios = scenarios_data["scenarios"] +evaluator = PolicyEvaluator(policies=[policy]) + +passed = 0 +failed = 0 + +print(f" {'Scenario':<32s} {'Expected':<10s} {'Actual':<10s} Result") +print(f" {'-' * 68}") + +for scenario in scenarios: + name = scenario["name"] + context = scenario.get("context", {}) + expected_action = scenario.get("expected_action") + expected_allowed = scenario.get("expected_allowed") + + decision = evaluator.evaluate(context) + actual_action = decision.action + actual_allowed = decision.allowed + + ok = True + if expected_action is not None and actual_action != expected_action: + ok = False + if expected_allowed is not None and actual_allowed != expected_allowed: + ok = False + + # For display, show whichever field the scenario tested + expected_display = expected_action if expected_action is not None else str(expected_allowed).lower() + actual_display = actual_action if expected_action is not None else str(actual_allowed).lower() + + status = "\u2705 pass" if ok else "\u274c FAIL" + print(f" {name:<32s} {expected_display:<10s} {actual_display:<10s} {status}") + + if ok: + passed += 1 + else: + failed += 1 + +total = passed + failed +print() +if failed == 0: + print(f" \u2705 {passed}/{total} scenarios passed") +else: + print(f" \u274c {passed}/{total} scenarios passed, {failed} failed") + +print() +print(" Each scenario is one line in a YAML file. The test runner") +print(" evaluates the policy and compares the actual result to the") +print(" expected result. No manual checking required.") + +# ── Part 3: The test matrix ────────────────────────────────────────── + +print("\n--- Part 3: The test matrix ---\n") + +print(" Loading policies from chapters 2 and 4...") + +# Role policies from Chapter 2 +reader_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "02_reader_policy.yaml") +admin_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "02_admin_policy.yaml") + +# Environment policy from Chapter 4 +env_policy = PolicyDocument.from_yaml(EXAMPLES_DIR / "04_env_policy.yaml") + +# Combine: each role gets its own policy + the shared environment policy. +# The evaluator merges all rules and sorts by priority — the first +# matching rule wins. This is where surprising interactions happen. +role_policies = { + "reader": [reader_policy, env_policy], + "admin": [admin_policy, env_policy], +} + +environments = ["development", "production"] +tools = [ + "search_documents", + "write_file", + "send_email", + "delete_database", + "transfer_funds", +] + +# What the team intends — the "answer key": +# Reader: cannot write_file, send_email, delete_database (from ch2) +# Admin: cannot delete_database (from ch2) +# Production: everything blocked (from ch4) +# Development: role-based rules apply +intended = { + ("reader", "development", "search_documents"): True, + ("reader", "development", "write_file"): False, # ch2 blocks it + ("reader", "development", "send_email"): False, + ("reader", "development", "delete_database"): False, + ("reader", "development", "transfer_funds"): True, + ("reader", "production", "search_documents"): False, + ("reader", "production", "write_file"): False, + ("reader", "production", "send_email"): False, + ("reader", "production", "delete_database"): False, + ("reader", "production", "transfer_funds"): False, + ("admin", "development", "search_documents"): True, + ("admin", "development", "write_file"): True, + ("admin", "development", "send_email"): True, + ("admin", "development", "delete_database"): False, + ("admin", "development", "transfer_funds"): True, + ("admin", "production", "search_documents"): False, + ("admin", "production", "write_file"): False, + ("admin", "production", "send_email"): False, + ("admin", "production", "delete_database"): False, + ("admin", "production", "transfer_funds"): False, +} + +# Print the matrix header +print() +print(f" {'Tool':<22s}", end="") +for role in role_policies: + for env in environments: + short_env = "dev" if env == "development" else "prod" + label = f"{role}/{short_env}" + print(f" {label:<13s}", end="") +print() +print(f" {'-' * 74}") + +matrix_pass = 0 +matrix_total = 0 +surprises = [] + +for tool in tools: + print(f" {tool:<22s}", end="") + for role, policies in role_policies.items(): + for env in environments: + evaluator = PolicyEvaluator(policies=list(policies)) + decision = evaluator.evaluate({"tool_name": tool, "environment": env}) + icon = "\u2705 allow " if decision.allowed else "\U0001f6ab deny " + + exp = intended.get((role, env, tool)) + matrix_total += 1 + if exp is not None and decision.allowed == exp: + matrix_pass += 1 + print(f" {icon} ", end="") + else: + surprises.append((role, env, tool, exp, decision)) + print(f" {icon} \u26a0\ufe0f ", end="") + print() + +print() +if surprises: + print(f" {matrix_pass}/{matrix_total} cells match expectations. " + f"{len(surprises)} surprise(s):\n") + for role, env, tool, exp, decision in surprises: + exp_label = "deny" if not exp else "allow" + act_label = "allow" if decision.allowed else "deny" + print(f" \u26a0\ufe0f {role} + {env} + {tool}") + print(f" Expected: {exp_label}") + print(f" Actual: {act_label} (from rule: {decision.matched_rule or 'default'})") + print(f" Reason: {decision.reason}") + print() + print(" The reader policy blocks write_file at priority 80.") + print(" But the environment policy allows development at priority 90.") + print(" Priority 90 beats 80 \u2014 the environment rule fires first.") + print(" Without the test matrix, this interaction is invisible.") +else: + print(f" \u2705 {matrix_pass}/{matrix_total} cells match expectations") + +# ── Part 4: Catch a regression ──────────────────────────────────────── + +print("\n--- Part 4: Catch a regression ---\n") + +print(" Scenario: someone edits the policy and removes the phrase") +print(' "requires human approval" from the transfer_funds rule.') +print(" The tool silently flips from escalate to hard deny.") +print() + +# Deep-copy the policy and modify the message +modified_policy = copy.deepcopy(policy) +for rule in modified_policy.rules: + if rule.name == "escalate-transfer-funds": + rule.message = "Sensitive action: transfer_funds is blocked" + break + +# Evaluate transfer_funds with the original and modified policies +original_eval = PolicyEvaluator(policies=[policy]) +modified_eval = PolicyEvaluator(policies=[modified_policy]) + +orig_decision = original_eval.evaluate({"tool_name": "transfer_funds"}) +mod_decision = modified_eval.evaluate({"tool_name": "transfer_funds"}) + +orig_tier, orig_icon = classify(orig_decision) +mod_tier, mod_icon = classify(mod_decision) + +print(f" Original policy: transfer_funds \u2192 {orig_icon} ({orig_tier})") +print(f" Modified policy: transfer_funds \u2192 {mod_icon} ({mod_tier})") +print() + +if orig_tier != mod_tier: + print(f" \u274c Regression detected!") + print(f" transfer_funds changed from '{orig_tier}' to '{mod_tier}'.") + print(f" The edit removed the escalation keyword, so the action") + print(f" that used to pause for human review now silently blocks.") +else: + print(" No regression (tiers match).") + +print() +print(" A human scanning the YAML diff might miss this. But a test") +print(" scenario that checks for the escalation keyword catches it") +print(" instantly. That is the value of automated policy testing:") +print(" changes that look harmless cannot silently break behavior.") + +print("\n" + "=" * 60) +print(" Policies are code. Test them like code.") +print(" Validate the structure, write expected outcomes,") +print(" run them automatically, and catch regressions") +print(" before they reach production.") +print("=" * 60) diff --git a/docs/tutorials/policy-as-code/examples/06_test_policy.yaml b/docs/tutorials/policy-as-code/examples/06_test_policy.yaml new file mode 100644 index 00000000..f419340f --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/06_test_policy.yaml @@ -0,0 +1,73 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Chapter 6: Policy Testing — a combined policy for automated testing. +# +# This policy merges the governance concepts from Chapters 1-5 into a +# single file so that test scenarios can verify every decision tier: +# - Always allowed (search_documents) +# - Always denied (delete_database) +# - Escalation (transfer_funds, send_email) +# - Explicit deny (write_file) +# - Default allow (anything not listed) + +version: "1.0" +name: test-policy +description: > + Combined policy for automated testing. Covers allow, deny, + escalation-tagged deny, and default-allow so that test scenarios + can verify every decision path in one pass. + +rules: + # Tier 1: Always denied — irreversibly destructive + - name: block-delete-database + condition: + field: tool_name + operator: eq + value: delete_database + action: deny + priority: 100 + message: "Destructive action: deleting databases is never allowed" + + # Tier 2: Escalation — needs human review + - name: escalate-transfer-funds + condition: + field: tool_name + operator: eq + value: transfer_funds + action: deny + priority: 90 + message: "Sensitive action: transfer_funds requires human approval" + + - name: escalate-send-email + condition: + field: tool_name + operator: eq + value: send_email + action: deny + priority: 85 + message: "Sensitive action: send_email requires human approval" + + # Tier 3: Always allowed — safe, read-only actions + - name: allow-search-documents + condition: + field: tool_name + operator: eq + value: search_documents + action: allow + priority: 80 + message: "Safe action: searching documents is always allowed" + + # Tier 4: Explicit deny — not needed by this agent + - name: block-write-file + condition: + field: tool_name + operator: eq + value: write_file + action: deny + priority: 70 + message: "Write access is not permitted for this agent" + +defaults: + action: allow + max_tool_calls: 10 diff --git a/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml b/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml new file mode 100644 index 00000000..940c8d45 --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml @@ -0,0 +1,47 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Chapter 6: Policy Testing — declarative test scenarios. +# +# Each scenario describes one tool call and the expected outcome. +# Run with: +# python -m agent_os.policies.cli test 06_test_policy.yaml 06_test_scenarios.yaml + +scenarios: + # ── Always allowed ────────────────────────────────────────────────── + - name: search-documents-allowed + context: { tool_name: search_documents } + expected_action: allow + + # ── Always denied (destructive) ───────────────────────────────────── + - name: delete-database-denied + context: { tool_name: delete_database } + expected_action: deny + + # ── Escalation-tagged (deny with "requires human approval") ───────── + - name: transfer-funds-denied + context: { tool_name: transfer_funds } + expected_action: deny + + - name: send-email-denied + context: { tool_name: send_email } + expected_action: deny + + # ── Explicit deny (not needed by this agent) ──────────────────────── + - name: write-file-denied + context: { tool_name: write_file } + expected_action: deny + + # ── Default action (tool not mentioned in any rule) ───────────────── + - name: unknown-tool-uses-default + context: { tool_name: read_logs } + expected_action: allow + + # ── Same checks using expected_allowed (boolean) ──────────────────── + - name: search-documents-is-allowed + context: { tool_name: search_documents } + expected_allowed: true + + - name: delete-database-is-not-allowed + context: { tool_name: delete_database } + expected_allowed: false diff --git a/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml b/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml new file mode 100644 index 00000000..84a0e56b --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/07_policy_v1.yaml @@ -0,0 +1,67 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Chapter 7: Policy Versioning — version 1.0 baseline. +# +# This is the same combined policy from Chapter 6, representing +# the current production state before any updates. + +version: "1.0" +name: production-policy +description: > + Company-wide production policy — version 1.0 baseline. + Covers allow, deny, and escalation tiers for all five tools. + +rules: + # Tier 1: Always denied — irreversibly destructive + - name: block-delete-database + condition: + field: tool_name + operator: eq + value: delete_database + action: deny + priority: 100 + message: "Destructive action: deleting databases is never allowed" + + # Tier 2: Escalation — needs human review + - name: escalate-transfer-funds + condition: + field: tool_name + operator: eq + value: transfer_funds + action: deny + priority: 90 + message: "Sensitive action: transfer_funds requires human approval" + + - name: escalate-send-email + condition: + field: tool_name + operator: eq + value: send_email + action: deny + priority: 85 + message: "Sensitive action: send_email requires human approval" + + # Tier 3: Always allowed — safe, read-only actions + - name: allow-search-documents + condition: + field: tool_name + operator: eq + value: search_documents + action: allow + priority: 80 + message: "Safe action: searching documents is always allowed" + + # Tier 4: Explicit deny — restricted access + - name: block-write-file + condition: + field: tool_name + operator: eq + value: write_file + action: deny + priority: 70 + message: "Write access is not permitted for this agent" + +defaults: + action: allow + max_tool_calls: 10 diff --git a/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml b/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml new file mode 100644 index 00000000..00ace215 --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/07_policy_v2.yaml @@ -0,0 +1,75 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Chapter 7: Policy Versioning — version 2.0 update. +# +# Changes from v1: +# 1. Version bumped from 1.0 to 2.0 +# 2. block-write-file priority raised from 70 to 95 +# (fixes the Chapter 6 surprise where environment policy +# at priority 90 overrode the block at priority 70) +# 3. escalate-send-email converted to hard deny — legal decided +# send_email should be fully blocked, not escalated +# 4. escalate-transfer-funds message accidentally edited +# (removes "requires human approval" — introduces a regression) + +version: "2.0" +name: production-policy +description: > + Company-wide production policy — version 2.0 update. + Blocks send_email outright, fixes write_file priority. + +rules: + # Tier 1: Always denied — irreversibly destructive + - name: block-delete-database + condition: + field: tool_name + operator: eq + value: delete_database + action: deny + priority: 100 + message: "Destructive action: deleting databases is never allowed" + + # Tier 2: Escalation — needs human review + - name: escalate-transfer-funds + condition: + field: tool_name + operator: eq + value: transfer_funds + action: deny + priority: 90 + message: "Sensitive action: transfer_funds is blocked" + + # Changed: send_email is now a hard deny, no longer escalated + - name: escalate-send-email + condition: + field: tool_name + operator: eq + value: send_email + action: deny + priority: 85 + message: "Communication: send_email is blocked by policy" + + # Tier 3: Always allowed — safe, read-only actions + - name: allow-search-documents + condition: + field: tool_name + operator: eq + value: search_documents + action: allow + priority: 80 + message: "Safe action: searching documents is always allowed" + + # Tier 4: Explicit deny — priority raised to beat environment rules + - name: block-write-file + condition: + field: tool_name + operator: eq + value: write_file + action: deny + priority: 95 + message: "Write access is not permitted for this agent" + +defaults: + action: allow + max_tool_calls: 10 diff --git a/docs/tutorials/policy-as-code/examples/07_policy_versioning.py b/docs/tutorials/policy-as-code/examples/07_policy_versioning.py new file mode 100644 index 00000000..8526de89 --- /dev/null +++ b/docs/tutorials/policy-as-code/examples/07_policy_versioning.py @@ -0,0 +1,196 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +""" +Chapter 7: Policy Versioning — Compare, Test, and Catch Regressions + +Shows how to diff two policy versions, test both with the same contexts, +and detect regressions before deploying the new version. + +Run from the repo root: + pip install agent-os-kernel[full] + python docs/tutorials/policy-as-code/examples/07_policy_versioning.py +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +# Allow running from the repo root without installing the packages. +_REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent +sys.path.insert(0, str(_REPO_ROOT / "packages" / "agent-os" / "src")) + +from agent_os.policies import PolicyEvaluator +from agent_os.policies.schema import PolicyDocument + +EXAMPLES_DIR = Path(__file__).parent + +ESCALATION_KEYWORD = "requires human approval" + + +def classify(decision): + """Classify a PolicyDecision into allow / escalate / deny.""" + if decision.allowed: + return ("allow", "\u2705 allow ") + if decision.reason and ESCALATION_KEYWORD in decision.reason.lower(): + return ("escalate", "\u23f3 escalate") + return ("deny", "\U0001f6ab deny ") + + +def diff_rules(v1_doc, v2_doc): + """Compare two PolicyDocuments rule-by-rule. Return a list of change strings.""" + diffs = [] + + # Top-level fields + if v1_doc.version != v2_doc.version: + diffs.append(f"version: {v1_doc.version} \u2192 {v2_doc.version}") + + # Index rules by name + v1_rules = {r.name: r for r in v1_doc.rules} + v2_rules = {r.name: r for r in v2_doc.rules} + + for name in v2_rules: + if name not in v1_rules: + diffs.append(f"rule added: {name}") + + for name in v1_rules: + if name not in v2_rules: + diffs.append(f"rule removed: {name}") + + for name in v1_rules: + if name not in v2_rules: + continue + r1 = v1_rules[name] + r2 = v2_rules[name] + if r1.priority != r2.priority: + diffs.append(f"rule {name}: priority {r1.priority} \u2192 {r2.priority}") + if r1.message != r2.message: + diffs.append(f"rule {name}: message changed") + diffs.append(f" was: \"{r1.message}\"") + diffs.append(f" now: \"{r2.message}\"") + if r1.action != r2.action: + diffs.append(f"rule {name}: action {r1.action.value} \u2192 {r2.action.value}") + + # Defaults + if v1_doc.defaults.action != v2_doc.defaults.action: + diffs.append(f"defaults: action {v1_doc.defaults.action.value} \u2192 {v2_doc.defaults.action.value}") + if v1_doc.defaults.max_tool_calls != v2_doc.defaults.max_tool_calls: + diffs.append(f"defaults: max_tool_calls {v1_doc.defaults.max_tool_calls} \u2192 {v2_doc.defaults.max_tool_calls}") + + return diffs + + +# ── Part 1: Load both versions ──────────────────────────────────────── + +print("=" * 60) +print(" Chapter 7: Policy Versioning") +print("=" * 60) + +print("\n--- Part 1: Load both versions ---\n") + +v1 = PolicyDocument.from_yaml(EXAMPLES_DIR / "07_policy_v1.yaml") +v2 = PolicyDocument.from_yaml(EXAMPLES_DIR / "07_policy_v2.yaml") + +print(f" v1: '{v1.name}' version {v1.version} ({len(v1.rules)} rules)") +print(f" v2: '{v2.name}' version {v2.version} ({len(v2.rules)} rules)") + +# ── Part 2: Diff ────────────────────────────────────────────────────── + +print("\n--- Part 2: Diff the two versions ---\n") + +changes = diff_rules(v1, v2) + +if not changes: + print(" No differences found.") +else: + for change in changes: + print(f" {change}") + +print() +print(" The diff lists every structural change. But a diff cannot") +print(" tell you whether a change is safe. You need to test both") +print(" versions and compare the results.") + +# ── Part 3: Test both versions ──────────────────────────────────────── + +print("\n--- Part 3: Test both versions ---\n") + +eval_v1 = PolicyEvaluator(policies=[v1]) +eval_v2 = PolicyEvaluator(policies=[v2]) + +tools = [ + "search_documents", + "write_file", + "send_email", + "delete_database", + "transfer_funds", +] + +print(f" {'Tool':<22s} {'v1':<14s} {'v2':<14s} Changed?") +print(f" {'-' * 58}") + +results = [] + +for tool in tools: + context = {"tool_name": tool} + + d1 = eval_v1.evaluate(context) + d2 = eval_v2.evaluate(context) + + tier1, icon1 = classify(d1) + tier2, icon2 = classify(d2) + + changed = tier1 != tier2 + flag = "\u26a0\ufe0f yes" if changed else "" + + print(f" {tool:<22s} {icon1:<14s} {icon2:<14s} {flag}") + results.append((tool, tier1, tier2, changed)) + +changed_count = sum(1 for _, _, _, c in results if c) +print() +if changed_count == 0: + print(" \u2705 No behavioral changes between v1 and v2.") +else: + print(f" {changed_count} tool(s) changed behavior between versions.") + +# ── Part 4: Detect regressions ──────────────────────────────────────── + +print("\n--- Part 4: Detect regressions ---\n") + +# The team planned two changes in v2: +# - block-write-file priority raised (structural, no behavioral change here) +# - send_email converted from escalation to hard deny (legal decision) +# Anything else that changed is a regression. +expected_changes = {"send_email"} + +regressions = [] + +for tool, tier1, tier2, changed in results: + if not changed: + continue + if tool in expected_changes: + print(f" \u2705 {tool}: {tier1} \u2192 {tier2} (expected \u2014 legal decision)") + else: + print(f" \u274c {tool}: {tier1} \u2192 {tier2} (REGRESSION)") + regressions.append((tool, tier1, tier2)) + +if not regressions: + print() + print(" \u2705 All changes are expected. Safe to deploy v2.") +else: + print() + for tool, old, new in regressions: + print(f" \u274c Regression: {tool}") + print(f" Was '{old}' in v1, now '{new}' in v2.") + print(f" The v2 edit removed the escalation keyword from the") + print(f" message, so the action that used to pause for human") + print(f" review now silently blocks.") + print() + print(" Fix the regression in v2, then re-run this comparison.") + print(" Do not deploy until all changes are expected.") + +print("\n" + "=" * 60) +print(" Policy versioning closes the loop.") +print(" Tag a version, diff it, test both, catch regressions.") +print(" No policy update ships without passing this check.") +print("=" * 60)