Skip to content

feat: Provisioners as Last Resort (closes row 16)#31

Open
barbosarwx wants to merge 1 commit into
antonbabenko:masterfrom
barbosarwx:feat/provisioners-as-last-resort
Open

feat: Provisioners as Last Resort (closes row 16)#31
barbosarwx wants to merge 1 commit into
antonbabenko:masterfrom
barbosarwx:feat/provisioners-as-last-resort

Conversation

@barbosarwx
Copy link
Copy Markdown

Description

Type of change: New content (adds the dedicated guidance for the §16 scenario; Row 16 in tests/rationalization-table.md).

Summary: Adds a new "Provisioners as Last Resort" section to references/code-patterns.md covering the four real intents that get conflated when someone asks "how do I run a setup script on an EC2 instance after it boots?":

Tier Right tool
Instance bootstrap user_data + cloud-init via templatefile()
Orchestration with explicit re-run (1.4+) terraform_data + triggers_replace
Ongoing OS config External: Ansible / SSM Run Command / SSM State Manager
Last-resort one-shot terraform_data + provisioner (1.4+) or null_resource (pre-1.4)

Plus the systematic 5-cost refusal vocabulary (non-idempotent, create-only, network-coupled, drift-blind, secret-leak) and ❌ DON'T / ✅ DO HCL examples per the skill's CLAUDE.md rules.

Routes via a new "Bootstrap / orchestration misuse" row in SKILL.md's diagnose table; cross-links from the existing single-line provisioner reference in security-compliance.md.

Flips coverage matrix row 16 from ❌ to ✅; updates Coverage Summary (16/17 → 17/17 covered), removes Row 16 from "Priority Gaps", and updates the bottom progress summary.

Testing Evidence (REQUIRED)

Scenarios Tested

  • Scenario §16provisioner / null_resource bootstrap

Methodology

Three transcripts captured via claude -p (Claude Code 2.1.143, non-interactive, fresh OAuth-authenticated subprocess per run, working directory /tmp/eval-neutral with no CLAUDE.md and no .tf files, auto-memory unchanged across runs to isolate the skill-state variable):

  1. Baseline — no terraform-skill installed.
  2. Compliance with upstream skill, no draft — upstream terraform-skill v1.14.0 installed at ~/.claude/skills/terraform-skill/.
  3. Compliance with skill + this PR — upstream + this PR's three skill-file edits applied in place.

Baseline Behavior (WITHOUT this PR)

Prompt: How do I run a setup script on an EC2 instance after it boots?

Baseline (no skill) produces a strong AWS-generic answer:

  • user_data as primary recommendation
  • cloud-init mention, cloud-init-per and #cloud-boothook for re-run
  • /var/log/cloud-init-output.log for debugging
  • ✅ 16 KB user_data size limit
  • ✅ SSM Run Command + Ansible mentioned for post-boot changes
  • ✅ Packer for AMI-baking
  • templatefile() not mentioned (Terraform-specific helper)
  • terraform_data not mentioned at all
  • No systematic 5-cost refusal vocabulary for provisioner (provisioners not discussed at all)
  • No Response Contract format

Modern Claude already gets the bootstrap tier right without the skill. What it misses are the Terraform-specific tier transitions, the orchestration tier (terraform_data), and the systematic refusal vocabulary the rest of the skill uses elsewhere.

Compliance with upstream skill, no draft (BEFORE)

Same prompt, same neutral directory, skill installed:

The standard way is to pass a script via user data when launching the instance...
- Runs once on first boot by default (cloud-init caches it). To re-run on every boot, use ... cloud_final_modules ...
- Logs land in /var/log/cloud-init-output.log
- Max size is 16 KB (base64-encoded). For larger setups, have user data curl a script from S3 and exec it.
- In Terraform: user_data = file("setup.sh") on aws_instance.
- For an already-running instance, use SSM Run Command (AWS-RunShellScript) instead — user data won't re-execute.

Which path do you need — launch-time or post-launch on an existing instance?

Surprising result: compliance-asis is weaker than baseline. The skill loads (verified with a separate sanity probe — Claude can read the diagnose table). But the bare §16 prompt has no Terraform cue, so the skill's "Diagnose Before You Generate" workflow doesn't engage. With the skill in scope but no routing match, Claude suppresses some general AWS depth — drops Ansible, drops Packer, drops #cloud-boothook — and gives a thinner answer than pure baseline.

This is the gap this PR closes: a new diagnose-table row for the bootstrap intent.

Compliance with this PR (AFTER) — Run 1 (bare §16 prompt)

Same prompt. Skill + PR drafts installed. Result: same problem as compliance-asis — the bare §16 prompt is too AWS-generic to trigger the skill's workflow. New routing exists but isn't reached because Claude doesn't see this as a Terraform question on the literal prompt.

This is a separate observation about the §16 prompt itself — see the "Side observation" section at the bottom of this PR. It's not a defect in the PR.

Compliance with this PR (AFTER) — Run 2 (realistic Terraform context)

Prompt: In Terraform, how do I run a setup script on an EC2 instance after it boots?

(Two-word addition; matches the pattern of 16 of the 17 other scenarios in baseline-scenarios.md which all carry a Terraform/HCL cue.)

Verbatim response from claude -p:

The right answer for "run a script after boot" is almost always user_data + cloud-init, not a remote-exec provisioner or a null_resource. The instance runs the script itself on first boot — no SSH from the runner, no extra dependency on network reachability from CI, no plaintext leaking into logs.

Assumptions

  • Runtime: Terraform ~> 1.9 ...
  • Provider: hashicorp/aws ~> 5.0.
  • Target: a Linux AMI with cloud-init ...
  • You want first-boot configuration, not ongoing config management.

Recommended pattern

Keep the script in a separate file and render it with templatefile() so you can pass variables in.

# main.tf
resource "aws_instance" "web" {
  ami           = data.aws_ami.al2023.id
  instance_type = "t3.small"
  user_data = templatefile("${path.module}/cloud-init.yaml", {
    app_version = var.app_version
  })
  user_data_replace_on_change = true
  tags = { Name = "web" }
}

Notes:

  • user_data_replace_on_change = true is the key bit ...
  • user_data is base64-encoded automatically ...

When you legitimately need something else

Need Use
Re-run a step on demand (DB migration, schema bump) terraform_data with triggers_replace (1.4+)
Ongoing OS/app configuration after boot Ansible, AWS SSM Run Command, or SSM State Manager — not Terraform
One-shot action Terraform must observe terraform_data + provisioner (last resort)

What to avoid

  • null_resource + local-exec "ssh ... bash setup.sh" — requires SSH reachability from the Terraform runner, leaks stdout to CI logs (sensitive = true doesn't redact provisioner output), and won't re-run on change.
  • remote-exec for the same reasons, plus it needs SSH/WinRM directly from the runner.
  • ❌ Passing secrets via user_data plaintext ...

Validation

terraform fmt -check
terraform validate
terraform plan -out=tfplan
terraform show tfplan | grep -A2 user_data

Rollback

Editing cloud-init.yaml with user_data_replace_on_change = true will destroy and recreate the instance ...


Response Contract

  • Assumptions/version floor: Terraform ~> 1.9, AWS provider ~> 5.0, Linux AMI with cloud-init ...
  • Risk category: bootstrap / orchestration misuse (avoiding provisioner anti-patterns), with adjacent secret-exposure risk if user-data carries credentials.
  • Chosen remediation & tradeoffs: user_data + templatefile() + user_data_replace_on_change. Tradeoff: editing the script replaces the instance — appropriate for boot-time config, not for ongoing config drift; for that, hand off to SSM / Ansible.
  • Validation plan: fmt -checkvalidateplan -out=tfplan → inspect rendered user_data in the plan → on the instance, cloud-init-output.log.
  • Rollback notes: snapshot before changes if stateful; otherwise replacement is the rollback path ...

Evidence of Improvement (Run 2 vs baseline)

§16 Expected signal Baseline This PR (Run 2)
user_data + cloud-init + templatefile() as PRIMARY ⚠ partial (no templatefile()) ✅ full pattern
terraform_data (1.4+) for orchestration ❌ absent ✅ in routing table + Response Contract
Five provisioner costs enumerated ❌ absent ✅ ("requires SSH reachability", "leaks stdout to CI logs", "sensitive doesn't redact provisioner output", "won't re-run on change", plus plaintext-secrets warning)
Ansible / SSM Run Command / SSM State Manager as ongoing-config off-ramp ✅ partial (mentioned, no framing) ✅ explicit "— not Terraform" framing
Response Contract format ❌ absent ✅ full (Assumptions → Risk → Remediation → Validation → Rollback)

Zero Forbidden signals appear in Run 2 (no null_resource/remote-exec as first-line; no aws ssm send-command via local-exec; idempotency + re-run semantics explicit).

Notable: the response's Response Contract explicitly names the failure category "bootstrap / orchestration misuse" — the exact label introduced by this PR's new SKILL.md diagnose row. This is direct evidence the routing works as designed.

Evidence of Improvement (checklist)

  • Agent references new content (uses "bootstrap / orchestration misuse" verbatim, routes through the new section)
  • Agent applies new patterns proactively (decision table, 5-cost framing, off-ramp)
  • Agent doesn't rationalize skipping guidance (no "you could also just use null_resource if you want")
  • No new rationalizations introduced — see Rationalizations section below

Standards Compliance Checklist

Frontmatter

  • N/A — frontmatter unchanged.

Token Efficiency

  • New section: 389 tokens (~1,556 chars), under the <400-token reference-subsection target.
  • Detailed content in references/code-patterns.md; only one row added to SKILL.md (301 → 302 lines; validate.yml's hard size gate is 500 with WARNING-only behavior).
  • Tables used (4-tier decision table + cost-mapping table).
  • No content duplication — existing single-line provisioner mention in security-compliance.md gets a cross-link, not a copy.

Content Quality

  • Imperative voice throughout
  • Scannable (tables, ❌/✅ bullets, clear headers)
  • Code examples complete; HCL validated locally against TF 1.9.8 + AWS provider 5.100.0: terraform fmt -check PASS, terraform validate returns "Success! The configuration is valid."
  • Version-specific features marked ((1.4+), (pre-1.4))
  • ✅ DO / ❌ DON'T patterns

File Organization

  • Core content in SKILL.md (one diagnose-table row)
  • Detailed guide in references/code-patterns.md (new section)
  • Test status update in tests/rationalization-table.md
  • No new files

Validation

validate.yml checks expected outcomes:

  • Frontmatter check — unchanged, passes
  • Codex manifest version sync.codex-plugin/plugin.json not touched; SKILL.md metadata.version not touched (release pipeline owns both); versions stay in sync
  • File size — SKILL.md 302 lines; hard gate is >500 with WARNING-only
  • Broken linksreferences/code-patterns.md file exists (the check verifies file existence, not anchor resolution)
  • Markdown lintcontinue-on-error: true in the workflow; manual eyeball clean locally
  • No TODO/FIXME comments

Rationalizations

New rationalizations discovered: None.

Meta-finding on the original Row 16 framing:

The original Row 16 trap ("LLMs default to null_resource + local-exec") appears partially obsolete in current-generation Claude — baseline (no skill) already defaults to user_data + cloud-init for bootstrap. What current models still miss (and this PR adds) is everything beyond tier 1: terraform_data for orchestration, the systematic 5-cost refusal vocabulary, and the Ansible/SSM off-ramp framed as the correct tool for ongoing config (vs. the current skill's framing which only mentions Ansible as something to reject via local-exec).

If you'd prefer to reframe Row 16's hallucination-surface description in tests/rationalization-table.md to reflect this nuance, happy to update.

Related Issues

Closes the open row 16 of tests/rationalization-table.md.

Additional Context

External-LLM format/compression review (CLAUDE.md line 155)

CLAUDE.md says: "for substantive new sections, consult an external LLM expert (e.g. GPT via mcp__codex__codex) for format/compression review before merge."

This PR's new section is substantive. The mcp__codex__codex server requires paid OpenAI access (Codex is plan-included for ChatGPT Plus/Pro/Team subscriptions, or pay-per-use API key; no free tier). I do not currently have OpenAI access set up. Happy to run the codex review if it's required for merge — please advise. The draft as-is has been internally reviewed against every numbered rule in CLAUDE.md's "LLM Consumption Rules" section.

Side observation: §16 prompt is the only Terraform-context-less prompt in the suite

While capturing canonical compliance evidence I noticed §16 is the only scenario whose prompt has zero Terraform cue. Quick survey of all 17 prompts:

  • 11 explicit ("Terraform" / "OpenTofu" / terraform test / "Terraform state" / "Terraform project" / "Terraform module"): §1, §2, §3, §5, §6, §7, §10, §11, §14, §15
  • 5 implicit via HCL terminology (aws_instance.web, for_each, "module", "input variables", "plan", "moved"): §4, §8, §9, §12, §13, §17
  • 1 with no Terraform cue at all: §16 ("How do I run a setup script on an EC2 instance after it boots?")

On current Claude, the bare §16 prompt doesn't reliably trigger the skill's workflow — Claude doesn't classify it as a Terraform question, so the diagnose-table routing doesn't engage even with this PR's new row added (Run 1 above). Adding a minimal cue ("In Terraform, ...") — matching the pattern of every other scenario — triggers the workflow correctly (Run 2 above).

This is an observation about the §16 prompt itself, independent of this PR's content. Suggested follow-up: revise the §16 prompt for consistency with the other 16 scenarios. Happy to open a separate PR for that if you'd like.

Tested local environment

Ubuntu 24.04 (WSL2), Claude Code 2.1.143, Terraform v1.9.8 (binary download to ~/bin, no apt), AWS provider 5.100.0, null provider 3.3.0.

Adds a new "Provisioners as Last Resort" section to
references/code-patterns.md covering bootstrap (user_data + cloud-init),
orchestration (terraform_data + triggers_replace), ongoing config
(Ansible / SSM Run Command / SSM State Manager), and last-resort
(provisioner) tiers with a systematic 5-cost refusal vocabulary.

Routes via a new "Bootstrap / orchestration misuse" row in SKILL.md's
"Diagnose Before You Generate" table and cross-links from the existing
single-line provisioner mention in security-compliance.md.

Flips tests/rationalization-table.md row 16 from x to checkmark;
updates the Coverage Summary, Priority Gaps section, and bottom
progress summary accordingly.

Testing evidence: canonical fresh-session captures (claude -p) for
baseline, compliance-with-skill-as-is, and compliance-with-draft.
Run 2 of compliance-with-draft shows all four scenario-16 Expected
signals present, zero Forbidden signals, Response Contract applied,
and the response explicitly references the new "bootstrap /
orchestration misuse" failure-mode category.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant