Conversation
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.
After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.
| <!-- <interview-phase-1> --> | ||
| 1. What is the name of the AI agent you are evaluating? If it does not have a name yet, give it one. | ||
| 2. What specific business problem or scenario does this agent address? | ||
| 3. What are the business KPIs associated with this agent (for example, increase revenue, decrease costs, transform business process)? | ||
| 4. What tasks is this agent designed to perform? What is explicitly out of scope? | ||
| 5. What are key risks (Responsible AI Framework) in implementing this agent (for example, PII vulnerabilities, negative impact from model inaccuracy)? | ||
| 6. Who are the primary users of this agent? How likely is this agent to be adopted by primary users? What are barriers to adoption? | ||
| <!-- </interview-phase-1> --> |
There was a problem hiding this comment.
The XML comment boundaries (<!-- <interview-phase-1> --> … <!-- </interview-phase-1> -->) work as section markers, but the pattern used by other agents in this repo is to express the workflow as an enumerated Required Protocol that spells out each rule or constraint as a numbered item. The current Required Protocol section at the bottom of this file has four items, which is a good start.
Consider moving more of the behavioral expectations from the XML-bounded sections into the protocol list or into the phase headings themselves. For examples of how other agents structure this, see:
.github/agents/hve-core/subagents/phase-implementor.agent.md— Required Protocol with numbered invariants that are referenced from the Required Steps..github/agents/hve-core/subagents/prompt-evaluator.agent.md— Required Protocol for evaluation-specific constraints paired with Required Steps.
This would make the constraints directly visible and enumerable rather than embedded in template comment tags.
There was a problem hiding this comment.
The workflow is already expressed as an enumerated Required Protocol. It also has XML comment boundaries. I can remove the XML comment boundaries, but it is unclear how to move more of the behavioral expectations into the protocol list or into the phase heading themselves.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1279 +/- ##
==========================================
- Coverage 87.72% 87.71% -0.02%
==========================================
Files 61 61
Lines 9320 9320
==========================================
- Hits 8176 8175 -1
- Misses 1144 1145 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Description
This pull request adds a comprehensive new prompt,
eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.Key additions and improvements:
Evaluation Dataset Creation Workflow:
Dataset and Documentation Artifacts:
data/evaluation/with separate subfolders for datasets (.json,.csv) and documentation (curation-notes.md,metric-selection.md,tool-recommendations.md).Tooling and Persona Guidance:
Related Issue(s)
Closes #1267
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/agents/*.agent.md)Sample Prompts (for AI Artifact Contributions)
User Request:
Execution Flow:
Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:
Purpose: Gather all necessary context before generating any artifacts.
Phase 1: Agent Context**
Phase 2: Agent Capabilities
Phase 3: Evaluation Scenarios
Phase 4: Persona & Tooling
data/evaluation/datasets/.data/evaluation/docs/:data/evaluation/docs/.Decision Points & Tool Usage Summary
Output Artifacts:
data/evaluation/datasets/-eval-dataset.json
{ "metadata": { "schema_version": "1", "agent_name": "example-agent", "created_date": "2026-04-02", "version": "1.0.0", "total_pairs": 30, "distribution": { "easy": 6, "grounding_source_checks": 3, "hard": 12, "negative": 6, "safety": 3 }, "persona": "pro-code", "evaluation_mode": ["manual", "batch"], "recommended_tool": "azure-ai-foundry" }, "evaluation_pairs": [ {data/evaluation/docs/-curation-notes.md
data/evaluation/docs/-metric-selection.md
data/evaluation/docs/-tool-recommendations.md
Success Indicators:
Testing
/prompt-analyze3 times with all findings addressednpm run lint:all✅npm run lint:md-links✅npm run validate:copyright✅ (148/148 files, 100%)npm run spell-check✅ (281 files, 0 issues)npm run plugin:generate✅ (14 plugins, 0 errors)npm run plugin:validate✅ (0 errors)npm run lint:collections-metadata✅ (0 errors)Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
The following validation commands must pass before merging:
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generateSecurity Considerations