Skip to content

Harden Strands demo adapter and document live validation results#3

Merged
ShawnC7208 merged 1 commit into
mainfrom
launch-prep
Apr 18, 2026
Merged

Harden Strands demo adapter and document live validation results#3
ShawnC7208 merged 1 commit into
mainfrom
launch-prep

Conversation

@ShawnC7208
Copy link
Copy Markdown
Owner

Summary

  • Fixes three runtime errors that surfaced during live validation of both launch APDs through Claude (Anthropic API) via the Strands adapter
  • Documents the completed live validation in `docs/public-launch-checklist.md`

What changed

`adapters/strands/demo/run_apd_with_aer.py`

  • Auto-detect `ANTHROPIC_API_KEY` and configure `AnthropicModel` so the demo runs without AWS credentials (falls back to Strands default Bedrock if not set)
  • Tightened runtime envelope prompt to specify exact enum values for `tool_invocations.outcome` — live run showed the model returning descriptive sentences instead of `"success" | "failure" | "canceled"`
  • Added robust `AgentResult` text extraction with markdown fence stripping to handle non-text content blocks in Strands responses
  • Added evidence shape guard and outcome normalizer for live model responses

`docs/public-launch-checklist.md`

  • Marked live runtime validation ✅ complete (2026-04-17)
  • Recorded run results: both APDs passed schema validation; `refund-escalation-synthesized` was 0-difference against its APD; `invoice-logging` had 1 `failed-check` diff (model correctly flagged it could not send an email in a demo context — adapter rough edge, not a spec defect)
  • Documented adapter rough edges for follow-up

`CHANGELOG.md` — changes noted under `[Unreleased]`

`.gitignore` — `launch-validation/` run artifacts excluded

Live validation verdict

The spec, schema, and conformance comparison logic all held up under real Claude-backed execution. The one `failed-check` difference on `invoice-logging` demonstrates the comparison machinery working correctly, not a defect.

Test plan

  • `npm test` passes (SDK, CLI, smoke)
  • Both APDs ran live through Strands + Anthropic API
  • `aer validate --strict` passes for both generated AERs
  • `aer compare` passes for `refund-escalation-synthesized` (0 differences)
  • `aer compare` on `invoice-logging` shows expected 1 `failed-check` diff

🤖 Generated with Claude Code

- Auto-detect ANTHROPIC_API_KEY and use AnthropicModel so the demo
  runs without AWS credentials; falls back to Strands default otherwise.
- Tighten runtime envelope prompt to specify exact enum values for
  tool_invocations.outcome ("success", "failure", "canceled") after
  the live run showed the model returning descriptive sentences.
- Add robust AgentResult text extraction with markdown fence stripping
  to handle non-text content blocks in the Strands response.
- Add evidence shape guard (isinstance dict check) and outcome
  normalizer for live model responses.
- Mark live runtime validation complete in docs/public-launch-checklist.md
  with full run results and adapter rough edges documented.
- Exclude launch-validation/ run artifacts from git.
- Add adapter changes to CHANGELOG [Unreleased].

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ShawnC7208 ShawnC7208 merged commit 8d50b98 into main Apr 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant