Add experimental gh aw experiments darwin command#42081
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
|
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. |
There was a problem hiding this comment.
Pull request overview
Adds an experimental “Darwin mode” workflow-evolution command under the hidden gh aw experiments command group, allowing users to evaluate observed experiment counts, select a winner, archive a generation snapshot, and optionally rewrite the workflow’s experiments: frontmatter to promote the winner and define the next population.
Changes:
- Introduces
gh aw experiments darwin <workflow> <experiment>with--winner, repeatable--variant,--archive-dir,--apply, and--jsonoutput. - Implements Darwin planning: ranking variants by observed counts (tie-breaking by declaration order), winner selection, next-generation construction, optional archive writing, and frontmatter mutation.
- Adds focused unit tests for ranking/next-generation/frontmatter update/archive writing, plus docs describing Darwin mode usage.
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/experiments_darwin_command.go | Implements the Darwin subcommand, ranking/promotion logic, archive writing, and workflow frontmatter mutation. |
| pkg/cli/experiments_darwin_command_test.go | Adds unit tests covering ranking, next-generation construction, archive writing, and frontmatter updates. |
| pkg/cli/experiments_command.go | Wires the new darwin subcommand into the existing hidden experiments command group and updates help text. |
| pkg/cli/experiments_command_test.go | Extends command wiring/flag tests to include the new darwin subcommand. |
| docs/src/content/docs/experimental/experiments.md | Documents Darwin mode usage and behavior within the experiments feature. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 5
- Review effort level: Low
| if strings.TrimSpace(config.Experiment) == "" { | ||
| return nil, nil, errors.New("experiment name is required") | ||
| } | ||
| if strings.TrimSpace(config.ArchiveDir) == "" { | ||
| config.ArchiveDir = defaultDarwinArchiveDir | ||
| } |
| nextVariants := buildDarwinNextVariants(winner, expCfg.Variants, ranking, config.NextVariants) | ||
| if len(nextVariants) < 2 { | ||
| return nil, nil, fmt.Errorf("darwin generation for %q must contain at least 2 variants", config.Experiment) | ||
| } | ||
| archivePath, archivedAt := darwinArchiveLocation(config.ArchiveDir, workflowID, config.Experiment) | ||
| rankedRows := darwinRankingRows(ranking, expCfg.Variants[0], winner) | ||
|
|
||
| plan := &DarwinPlan{ | ||
| WorkflowID: workflowID, | ||
| WorkflowPath: workflowPath, | ||
| ExperimentName: config.Experiment, | ||
| Branch: branchName, | ||
| ArchivePath: archivePath, | ||
| ArchivedAt: archivedAt, | ||
| Apply: config.Apply, | ||
| Winner: winner, | ||
| CurrentVariants: slices.Clone(expCfg.Variants), | ||
| NextVariants: nextVariants, | ||
| Ranking: rankedRows, | ||
| Analysis: analysis, | ||
| } | ||
| archive := &DarwinArchive{ | ||
| WorkflowID: workflowID, | ||
| WorkflowPath: workflowPath, | ||
| ExperimentName: config.Experiment, | ||
| Branch: branchName, | ||
| ArchivedAt: archivedAt, | ||
| Winner: winner, | ||
| CurrentVariants: slices.Clone(expCfg.Variants), | ||
| NextVariants: slices.Clone(nextVariants), | ||
| Ranking: rankedRows, | ||
| Analysis: analysis, | ||
| State: state, | ||
| } |
| branchName := experimentsBranchPrefix + workflowID | ||
| details, err := fetchLocalExperimentDetails(branchName, workflowID) | ||
| if err != nil { | ||
| state := emptyExperimentState() | ||
| details = experimentDetailsFromState(workflowID, branchName, state) | ||
| if !strings.Contains(err.Error(), "not found locally") { | ||
| return nil, nil, err | ||
| } | ||
| } | ||
| stateRef := "origin/" + branchName | ||
| if !gitRefExists(stateRef) { | ||
| stateRef = branchName | ||
| } | ||
| state := emptyExperimentState() | ||
| if gitRefExists(stateRef) { | ||
| state = readLocalExperimentState(stateRef) | ||
| } | ||
|
|
| Example: ` ` + string(constants.CLIExtensionPrefix) + ` experiments darwin myworkflow style | ||
| ` + string(constants.CLIExtensionPrefix) + ` experiments darwin myworkflow style --winner concise | ||
| ` + string(constants.CLIExtensionPrefix) + ` experiments darwin myworkflow style --variant concise --variant detailed --apply | ||
| ` + string(constants.CLIExtensionPrefix) + ` experiments darwin myworkflow style --archive-dir .github/experiments/archive --json`, | ||
| Args: cobra.ExactArgs(2), |
| fmt.Fprintln(os.Stderr, console.FormatInfoMessage("Darwin mode: "+plan.WorkflowID+"/"+plan.ExperimentName)) | ||
| fmt.Fprintf(os.Stderr, " Branch: %s\n", plan.Branch) | ||
| fmt.Fprintf(os.Stderr, " Winner: %s\n", plan.Winner) | ||
| fmt.Fprintf(os.Stderr, " Recommendation:%s\n", plan.Analysis.Recommendation) |
Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (621 new lines in Draft ADR committed:
What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs MatterADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
|
There was a problem hiding this comment.
Darwin Mode review
The overall design is solid — Darwin Mode fits naturally into the existing experiments: feature, the ranking/promotion logic is clear, and the test coverage covers the key paths well. There are a few correctness and reliability issues that should be addressed before merge.
Blocking
| # | File | Issue |
|---|---|---|
| 1 | experiments_darwin_command.go:129 |
buildDarwinPlan errors are swallowed (return nil) — CLI exits 0 on failure |
Non-blocking (recommended fixes)
| # | File | Issue |
|---|---|---|
| 2 | experiments_darwin_command.go:427 |
printDarwinPlan writes all output to stderr; human-readable output should go to stdout |
| 3 | experiments_darwin_command.go:119 |
StringSlice splits on commas; StringArray is safer for variant names |
| 4 | experiments_darwin_command.go:136 |
No rollback if archive succeeds but frontmatter update fails |
| 5 | experiments_darwin_command.go:201 |
Fragile error-substring sentinel; prefer a typed sentinel error |
See inline comments for details and suggested fixes.
🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 59.1 AIC · ⌖ 6.86 AIC · ⊞ 4.8K
| plan, archive, err := buildDarwinPlan(config) | ||
| if err != nil { | ||
| fmt.Fprintln(os.Stderr, console.FormatErrorMessage(err.Error())) | ||
| return nil |
There was a problem hiding this comment.
Error silently swallowed — exit code will be 0 on failure.
Returning nil after printing the error means the CLI exits with code 0 even when buildDarwinPlan fails. Callers (scripts, CI jobs) cannot detect the failure. Change to return err so cobra propagates the non-zero exit code:
if err != nil {
return err
}If the intent is to suppress the stack trace (showing only the formatted message), consider wrapping as a display error type handled by the root command, consistent with other subcommands in this package.
@copilot please address this.
| } | ||
|
|
||
| func printDarwinPlan(plan *DarwinPlan) { | ||
| fmt.Fprintln(os.Stderr, console.FormatInfoMessage("Darwin mode: "+plan.WorkflowID+"/"+plan.ExperimentName)) |
There was a problem hiding this comment.
Human-readable output written to stderr instead of stdout.
All plan output in printDarwinPlan goes to os.Stderr. Non-error output should go to os.Stdout so users can pipe or redirect results. Sending informational output to stderr is reserved for errors and diagnostics:
// current — informational output on stderr prevents piping
fmt.Fprintln(os.Stderr, console.FormatInfoMessage(...))
// fix — write human-readable plan output to stdout
fmt.Fprintln(os.Stdout, console.FormatInfoMessage(...))Note that the JSON path already correctly uses os.Stdout (line 145). The text path should be consistent.
@copilot please address this.
| cmd.Flags().Bool("apply", false, "Archive the current generation and update the workflow file in place") | ||
| cmd.Flags().String("winner", "", "Override the promoted winner variant") | ||
| cmd.Flags().String("archive-dir", defaultDarwinArchiveDir, "Directory where Darwin archives are written when --apply is used") | ||
| cmd.Flags().StringSlice("variant", nil, "Variant names for the next generation (repeatable)") |
There was a problem hiding this comment.
StringSlice splits on commas — use StringArray instead.
cobra.StringSlice treats comma-separated values as multiple elements: --variant a,b becomes ["a", "b"] rather than ["a,b"]. This can surprise users and may silently misbehave if a future variant name contains a comma. StringArray treats each --variant flag as a single value:
// current — splits on commas unexpectedly
cmd.Flags().StringSlice("variant", nil, "...")
// fix — one flag = one value
cmd.Flags().StringArray("variant", nil, "...")The GetStringSlice call on line 102 should also change to GetStringArray.
@copilot please address this.
| if err := writeDarwinArchive(plan.ArchivePath, archive); err != nil { | ||
| return fmt.Errorf("failed to write Darwin archive: %w", err) | ||
| } | ||
| if err := applyDarwinPromotion(plan.WorkflowPath, plan.ExperimentName, plan.NextVariants); err != nil { |
There was a problem hiding this comment.
Partial failure leaves archive written but workflow not updated (no rollback).
If writeDarwinArchive succeeds but applyDarwinPromotion fails, the archive JSON is persisted on disk while the workflow frontmatter is not updated — leaving the state inconsistent with no way to know whether to re-run or not.
Options to consider:
- Write the archive last (after frontmatter is updated), so a failure before archive creation leaves the workflow clean.
- Write the archive to a temp path, apply the frontmatter, then rename the temp file atomically.
- Document this limitation with a
// TODOnoting that partial failure leaves an orphaned archive.
@copilot please address this.
| if err != nil { | ||
| state := emptyExperimentState() | ||
| details = experimentDetailsFromState(workflowID, branchName, state) | ||
| if !strings.Contains(err.Error(), "not found locally") { |
There was a problem hiding this comment.
Fragile error-message substring check as a sentinel.
Using strings.Contains(err.Error(), "not found locally") to distinguish "branch not found" from a real error couples this code to the exact phrasing returned by fetchLocalExperimentDetails. If that message ever changes (even capitalisation), this silently breaks by propagating the error instead of falling back gracefully.
Prefer a typed sentinel error (e.g., errors.Is(err, ErrBranchNotFound)) or a boolean return value from fetchLocalExperimentDetails to express the "not found" case explicitly:
// in fetchLocalExperimentDetails:
var ErrBranchNotFound = errors.New("experiments branch not found locally")
// here:
if !errors.Is(err, ErrBranchNotFound) {
return nil, nil, err
}@copilot please address this.
🧪 Test Quality Sentinel Report
📊 Metrics & Test Classification (10 tests analyzed)
Go: 10 (
|
Darwin Mode Review Summary 🧬REQUEST_CHANGES — Two correctness issues need addressing before merge; everything else is medium or minor. 🔴 Blocking (fix before merge)
🟡 Medium (should fix)
🟢 Minor (nice to have)
Overall the architecture is clean — Darwin Mode integrates nicely into the existing @copilot please address the review comments above.
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd, /diagnose, /zoom-out, /grill-with-docs, and /improve-codebase-architecture — requesting changes on a correctness bug and a few medium-priority issues.
📋 Key Themes & Highlights
Blocking Issues
- Exit-code bug (line 129):
RunExperimentsDarwinreturnsnilonbuildDarwinPlanfailure — the CLI exits 0 even when the workflow or experiment is not found. Should bereturn err. - Ranking silently zeroed (line 197): when the experiments branch only exists on
origin,details.Experimentsis empty and all variants rank equally, potentially promoting the wrong winner without warning.
Medium Issues
- Fragile error matching (line 201):
strings.Contains(err.Error(), "not found locally")should use a typed sentinel error. StringSlicevsStringArray(line 119): comma-separated--variant a,bis split into two variants unexpectedly;StringArraymatches the documented one-flag-per-value usage.- Side-effect ordering (line 132): archive + frontmatter mutations happen before JSON marshaling; marshal first, then apply.
Applyin output struct (line 51): CLI flag leaked intoDarwinPlan's JSON schema.
Minor Issues
- Archive timestamp collision (line 366)
- Missing space in
Recommendation:output (line 430) - Two missing test cases:
selectDarwinWinnererror path, and all-zero counts (generation zero)
Positive Highlights
- ✅ Clean integration into the existing
experiments:feature — no new top-level commands or parallel optimizer path - ✅ Covers both bare-array and object-form experiment frontmatter
- ✅ Good separation of
buildDarwinPlan(pure computation) from side effects (writeDarwinArchive,applyDarwinPromotion) - ✅ Comprehensive archive schema with full ranking snapshot and raw state — useful for auditing
- ✅ Documentation is clear with concrete examples and an important note about new variant names
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 128.7 AIC · ⌖ 10.6 AIC · ⊞ 6.6K
| plan, archive, err := buildDarwinPlan(config) | ||
| if err != nil { | ||
| fmt.Fprintln(os.Stderr, console.FormatErrorMessage(err.Error())) | ||
| return nil |
There was a problem hiding this comment.
[/tdd] RunExperimentsDarwin returns nil on buildDarwinPlan failure, so the CLI always exits with code 0 even when the workflow or experiment is not found. This masks errors from callers and CI scripting.
💡 Fix
Change return nil to return err:
if err != nil {
fmt.Fprintln(os.Stderr, console.FormatErrorMessage(err.Error()))
return err
}Note that writeDarwinArchive and applyDarwinPromotion errors (lines 134–138) are already propagated correctly — this brings buildDarwinPlan errors into line.
@copilot please address this.
| if err != nil { | ||
| state := emptyExperimentState() | ||
| details = experimentDetailsFromState(workflowID, branchName, state) | ||
| if !strings.Contains(err.Error(), "not found locally") { |
There was a problem hiding this comment.
[/improve-codebase-architecture] Error classification via strings.Contains(err.Error(), "not found locally") is fragile — it couples this code to the exact wording of an error message from fetchLocalExperimentDetails. A rename or rewording in that function will silently break the fallback path here.
💡 Suggestion
Introduce a sentinel error in the fetchLocalExperimentDetails package and use errors.Is:
// In the fetch function's package:
var ErrBranchNotFoundLocally = errors.New("branch not found locally")
// Here:
if err != nil {
if !errors.Is(err, ErrBranchNotFoundLocally) {
return nil, nil, err
}
details = experimentDetailsFromState(workflowID, branchName, emptyExperimentState())
}@copilot please address this.
| cmd.Flags().Bool("apply", false, "Archive the current generation and update the workflow file in place") | ||
| cmd.Flags().String("winner", "", "Override the promoted winner variant") | ||
| cmd.Flags().String("archive-dir", defaultDarwinArchiveDir, "Directory where Darwin archives are written when --apply is used") | ||
| cmd.Flags().StringSlice("variant", nil, "Variant names for the next generation (repeatable)") |
There was a problem hiding this comment.
[/grill-with-docs] StringSlice silently splits comma-separated input, so --variant a,b is parsed as two variants (a and b). StringArray treats each flag invocation as a single atomic value, which matches the documented usage (--variant concise --variant detailed).
💡 Change
// Before
cmd.Flags().StringSlice("variant", nil, "Variant names for the next generation (repeatable)")
// After
cmd.Flags().StringArray("variant", nil, "Variant names for the next generation (repeatable)")Also update the flag getter in RunE:
// Before
nextVariants, _ := cmd.Flags().GetStringSlice("variant")
// After
nextVariants, _ := cmd.Flags().GetStringArray("variant")Variant names in this codebase use underscores (e.g. step_by_step) so a comma collision is unlikely but the semantic mismatch with the docs is still worth fixing.
@copilot please address this.
| return nil | ||
| } | ||
|
|
||
| if config.Apply { |
There was a problem hiding this comment.
[/diagnose] Side effects (writeDarwinArchive + applyDarwinPromotion) are committed to disk before the JSON output block at line 141. If json.MarshalIndent were to fail (unlikely but possible), the archive file and workflow frontmatter would be mutated with no output to the caller. Consider marshaling first and returning early on error before writing to disk.
💡 Suggested ordering
var jsonBytes []byte
if config.JSONOutput {
var err error
jsonBytes, err = json.MarshalIndent(plan, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal JSON: %w", err)
}
}
if config.Apply {
if err := writeDarwinArchive(plan.ArchivePath, archive); err != nil {
return fmt.Errorf("failed to write Darwin archive: %w", err)
}
if err := applyDarwinPromotion(plan.WorkflowPath, plan.ExperimentName, plan.NextVariants); err != nil {
return fmt.Errorf("failed to update workflow file: %w", err)
}
}
if config.JSONOutput {
fmt.Fprintln(os.Stdout, string(jsonBytes))
return nil
}@copilot please address this.
| Branch string `json:"branch"` | ||
| ArchivePath string `json:"archive_path,omitempty"` | ||
| ArchivedAt string `json:"archived_at,omitempty"` | ||
| Apply bool `json:"apply"` |
There was a problem hiding this comment.
[/zoom-out] Apply bool is a command-layer flag, not result data — including it in DarwinPlan (the JSON output type) leaks the CLI invocation detail into the output schema. Consumers of --json output would see "apply": true/false, which is redundant (they already know whether they passed --apply) and may cause confusion.
💡 Suggestion
Remove Apply from DarwinPlan. printDarwinPlan already receives the full plan and the plan.Apply check at line 433 can instead use the config.Apply value passed into RunExperimentsDarwin:
func printDarwinPlan(plan *DarwinPlan, applied bool) {
...
if applied {
fmt.Fprintf(os.Stderr, " Archive: %s\n", plan.ArchivePath)
}
}@copilot please address this.
| } | ||
|
|
||
| func darwinArchiveLocation(baseDir, workflowID, experimentName string) (string, string) { | ||
| timestamp := time.Now().UTC().Format("20060102T150405Z") |
There was a problem hiding this comment.
[/diagnose] The archive filename uses a second-granularity UTC timestamp. Two Darwin runs for the same workflow+experiment within the same wall-clock second will produce identical paths, and os.WriteFile (line 394) will silently overwrite the earlier archive — losing a generation's history.
💡 Options
- Nanosecond precision:
time.Now().UTC().Format("20060102T150405.000000000Z")— sortable and collision-resistant in practice. - Short random suffix: append
fmt.Sprintf("-%04x", rand.Uint32()&0xffff)for a low-collision identifier. - Existence check: stat the path before writing and return an error if it already exists (
os.O_CREATE|os.O_EXCL).
@copilot please address this.
| fmt.Fprintln(os.Stderr, console.FormatInfoMessage("Darwin mode: "+plan.WorkflowID+"/"+plan.ExperimentName)) | ||
| fmt.Fprintf(os.Stderr, " Branch: %s\n", plan.Branch) | ||
| fmt.Fprintf(os.Stderr, " Winner: %s\n", plan.Winner) | ||
| fmt.Fprintf(os.Stderr, " Recommendation:%s\n", plan.Analysis.Recommendation) |
There was a problem hiding this comment.
[/tdd] Minor display bug: " Recommendation:%s\n" is missing a trailing space, producing Recommendation:EXTEND instead of Recommendation: EXTEND. The other fields use " Branch: %s\n" (with alignment padding), so this one stands out.
💡 Fix
// Before
fmt.Fprintf(os.Stderr, " Recommendation:%s\n", plan.Analysis.Recommendation)
// After
fmt.Fprintf(os.Stderr, " Recommendation: %s\n", plan.Analysis.Recommendation)A snapshot test for printDarwinPlan output would catch regressions like this automatically.
@copilot please address this.
| assert.Equal(t, "third", ranking[1].Name) | ||
| assert.Equal(t, "control", ranking[2].Name) | ||
| assert.True(t, ranking[2].CurrentControl) | ||
| } |
There was a problem hiding this comment.
[/tdd] selectDarwinWinner has no test for the error path: when --winner names a variant that is not in the declared list, the function returns an error. This branch is untested.
💡 Suggested test
func TestSelectDarwinWinnerRejectsUnknownOverride(t *testing.T) {
ranking := []DarwinVariantScore{
{Name: "control", Count: 5},
{Name: "challenger", Count: 3},
}
_, err := selectDarwinWinner("unknown", ranking)
require.Error(t, err)
assert.Contains(t, err.Error(), "not a declared variant")
}@copilot please address this.
| "github.com/stretchr/testify/require" | ||
| ) | ||
|
|
||
| func TestRankDarwinVariants(t *testing.T) { |
There was a problem hiding this comment.
[/tdd] TestRankDarwinVariants covers only a case with mixed counts. The generation-zero scenario — where all variants have zero counts (no experiment runs yet) — should be explicitly tested to confirm that declaration order is the tiebreaker and no panics occur.
💡 Suggested test
func TestRankDarwinVariantsAllZero(t *testing.T) {
ranking := rankDarwinVariants(
[]string{"alpha", "beta", "gamma"},
map[string]int{},
)
require.Len(t, ranking, 3)
// With all zero counts, declaration order is preserved
assert.Equal(t, "alpha", ranking[0].Name)
assert.Equal(t, "beta", ranking[1].Name)
assert.Equal(t, "gamma", ranking[2].Name)
assert.True(t, ranking[0].CurrentControl)
}@copilot please address this.
| } | ||
|
|
||
| branchName := experimentsBranchPrefix + workflowID | ||
| details, err := fetchLocalExperimentDetails(branchName, workflowID) |
There was a problem hiding this comment.
[/diagnose] fetchLocalExperimentDetails is called without falling back to the remote ref when the local branch doesn't exist. The details.Experiments used for ranking at line 214 may be all-zero even when origin/<branchName> has real counts — while state (read separately at lines 209–212) may carry actual data. This means Darwin could silently rank all variants equally and promote by declaration order on machines that haven't fetched the experiments branch.
💡 Suggestion
Ensure fetchLocalExperimentDetails (or a wrapper) checks origin/<branchName> as a fallback, or build stats from state when details.Experiments is empty:
if len(details.Experiments) == 0 && state != nil {
details = experimentDetailsFromState(workflowID, branchName, state)
}Also worth a test: run Darwin with a non-existent local branch but a populated state and confirm the ranking reflects the state's counts.
@copilot please address this.
This adds Darwin Mode as an extension of the existing experiments feature: a new
gh aw experiments darwinsubcommand that evaluates an experiment’s current variants, archives the current generation, and promotes a winner into the next generation. It keeps the workflow rooted inexperiments:rather than introducing a separate optimizer path.CLI surface
gh aw experiments darwin <workflow> <experiment>experimentscommand group--winnerto override the promoted variant--variantto define the next generation explicitly--archive-dirto control archive location--applyto persist archive + frontmatter changes--jsonfor machine-readable outputDarwin evaluation and promotion
experiments/<workflow>branch historyGeneration archive
.github/experiments/archive/<workflow>/<experiment>/Workflow mutation
experiments:frontmatter in-placeDocs and coverage
Example:
This promotes
conciseto the control slot, archives the current generation, and rewrites thestyleexperiment’s next variant population in the workflow frontmatter.