Integrate agent skill evals into testing

I have agent skills in this repository that need testing, but the current structure doesn't accommodate that yet.

This issue should create a structure under test for agent testing, likely test/agents/skills/SKILLNAME/evals.json and potentially a similar structure for rules, with an accompany eval run framework.

## Open Questions

- Should the behavior of ./script/test be changed to encompass all tests as part of this issue? **No**, while this is another addition of another type of test, we can use a separate runner at this time and handle the broader test entrypoint question separately.

## Success Criteria

- ./test/agents/rules/ folder exists
- ./test/agents/skills/ folder exists
- One or more eval runner scripts exist in test/agents/, precise naming and
  behavior to be determined
- Rules evals may be a placeholder at this time, as running "without rules"
  evals is less easy than one might hope due to agent context loading
  behavior.

## Implementation Notes

- Claude Code at least can run with an overridden `HOME` to prevent usual CLAUDE.md loading. If we pass through any token env variables, or run on a Mac (keychain access), it should still be authenticated, unlike use of `claude --bare`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate agent skill evals into testing #107

Open Questions

Success Criteria

Implementation Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Integrate agent skill evals into testing #107

Description

Open Questions

Success Criteria

Implementation Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions