feat: add PostgreSQL DAG-run store by yottahmd · Pull Request #2103 · dagucloud/dagu

yottahmd · 2026-05-06T10:31:53Z

Summary

add a PostgreSQL-backed DAG-run store behind the existing store interface
add goose migrations, sqlc queries, UUID v7 IDs, JSONB storage, typed PostgreSQL domains, and JSONB/query indexes
wire configuration, startup migration, and make dev-pg with a PostgreSQL 18 compose service

Testing

go tool sqlc generate
go test ./internal/cmn/config ./internal/cmn/schema ./internal/cmd ./internal/engine
go test ./internal/core/exec ./internal/persis/filedagrun ./internal/service/frontend/api/v1
go test ./internal/persis/dagrunstore/...
docker compose -p dagu-dev-pg-verify-18 -f compose.postgres.yaml config
docker compose -p dagu-dev-pg-verify-18 -f compose.postgres.yaml up -d postgres + pg_isready -U dagu -d dagu

Summary by CodeRabbit

Release Notes

New Features
- Added PostgreSQL as an optional persistent backend for control plane state, with configurable per-role settings for servers, schedulers, coordinators, and agents.
- Added control plane store configuration with file or PostgreSQL backend selection.
- Added queue item leasing support for distributed task processing.
- Added UUID v7 generation for improved ID ordering and performance.
Development
- Added PostgreSQL development workflow targets and Docker Compose configuration.
- Added PostgreSQL-based integration tests.

coderabbitai · 2026-05-06T10:32:06Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: baa31d97-dea1-4461-9a41-fcb7f07a7f82

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a PostgreSQL control-plane store, config/schema and loader support, runtime wiring (CLI/engine/frontend/scheduler), queue leasing, UUIDv7 identifiers, development docker-compose/Make targets, CI job, and extensive unit/integration tests.

Changes

Control-plane store and runtime integration

Layer / File(s)	Summary
Config/Schema/Interfaces `internal/cmn/config/`, `internal/cmn/schema/config.schema.json`, `internal/core/exec/`	Adds control_plane_store backend (file/postgres), loader/env/validation, JSON schema, invalid cursor error, and queue lease interfaces.
Postgres Store Core/SQL/Migrations `internal/persis/controlplanestore/postgres/**`, `sqlc.yaml`	Introduces Postgres store with migrations, sqlc config/models/queries, and generated data-access code.
Store Features `internal/persis/controlplanestore/postgres/*`	Implements DAG run attempts/list/query/retention, queue with cursor/leasing, audit/auth/workspace/session/distributed stores, event store, UUIDv7 helpers.
Runtime Wiring `internal/cmd/`, `internal/engine/engine.go`, `internal/service/frontend/server.go`, `internal/service/scheduler/`	Switches to control-plane/dagrunstore, adds close semantics, injects shared stores, enables scheduler queue leasing.
API Error Mapping `internal/service/frontend/api/v1/dagruns.go`	Maps exec.ErrInvalidQueryCursor to HTTP 400.
UUIDv7 Adoption various `internal/*`	Replaces random UUIDs with UUIDv7 across subsystems.
Dev/CI/Deps `Makefile`, `compose.postgres.yaml`, `.github/workflows/ci.yaml`, `.golangci.yml`, `go.mod`	Adds dev compose/targets, CI Postgres integration test job, linter exclusion, dependencies/tooling updates.
Tests various `*_test.go`	Adds/updates unit and integration tests for config, migrations/queries, roles/context, and full Postgres store flows.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

dagucloud/dagu#2113 — DAG-run/store wiring and agent integration adjustments.
dagucloud/dagu#2019 — Auth types/logic touched by new Postgres-backed auth stores.
dagucloud/dagu#2011 — Engine init/close changes related to store wiring.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/postgres-dagrun-store

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (6)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/engine/engine.go`:
- Around line 72-80: Engine/context teardown never closes the concrete
dagRunStore Postgres resources because exec.DAGRunStore lacks Close; add cleanup
by either (A) adding Close() error to the exec.DAGRunStore interface and
implement it on postgres.Store, then call e.dagRunStore.Close() in Engine.Close
(and the context shutdown in cmd/context.go), or (B) if you prefer minimal
change, perform a type-assertion where the store is created/closed: if closer,
ok := dagRunStore.(interface{ Close() error }); ok { _ = closer.Close() } (do
this in Engine.Close and the context shutdown paths referenced in
cmd/context.go) so the Postgres pool is released; ensure error handling/logging
around the Close call.

In `@internal/persis/dagrunstore/postgres/attempt.go`:
- Around line 231-254: WriteStepMessages currently does an unsafe
read-modify-write by calling GetAttempt, unmarshalling messages_data, mutating
in memory and calling UpdateAttemptMessages, which races with concurrent
writers; fix by performing the merge atomically in the DB instead of
client-side: add a new SQL exec query (e.g. MergeAttemptStepMessages) that uses
jsonb_set(coalesce(messages_data,'{}'::jsonb), ARRAY[sqlc.arg(step_name)::text],
sqlc.arg(messages)::jsonb, true) and updates updated_at, then replace the
GetAttempt + Marshal/Unmarshal + UpdateAttemptMessages sequence in
Attempt.WriteStepMessages to call this new query (or alternatively wrap
GetAttempt/UpdateAttemptMessages in a transaction and SELECT ... FOR UPDATE) to
prevent lost concurrent writes.

In `@internal/persis/dagrunstore/postgres/cursor.go`:
- Around line 119-121: Replace the ignored json.Marshal error with proper
handling: change the marshal call to capture the error (data, err :=
json.Marshal(normalized)) and add a short check (if err != nil { panic(err) } or
propagate/handle per surrounding conventions) before computing the sha256 and
calling hex.EncodeToString; this touches the json.Marshal call that produces
data, the variable normalized, and the subsequent
sha256.Sum256/hex.EncodeToString use.

In `@internal/persis/dagrunstore/postgres/db/attempts.sql.go`:
- Around line 1-13: The sqlc-generated Go files in package db (files starting
with the "Code generated by sqlc" header) are not excluded by the current tool
patterns, so update the tooling: either add the sqlc output directory
(internal/persis/dagrunstore/postgres/db) to the exclusion lists in the Makefile
`addlicense` target and .golangci.yml (matching the "Code generated by sqlc"
files), or change the sqlc config to emit generated code into a `gen/`
subdirectory so existing `-ignore "./**/gen/**"` rules apply; ensure the change
prevents addlicense from modifying files with the sqlc header.

In `@internal/persis/dagrunstore/postgres/store.go`:
- Around line 489-495: When iterating rows, don't silently drop decode errors
from statusFromListRow: call statusFromListRow(row) as before but if err != nil
then if errors.Is(err, exec.ErrNoStatusData) continue; otherwise log the
unexpected decode error (including err and identifying row keys like run ID)
using the package/store logger before continuing so schema drift/corruption is
visible; keep the existing label filter check
(core.NewLabels(status.Labels).MatchesFilters) unchanged.
- Around line 264-281: The RecentAttempts method currently swallows DB errors
(the error returned from s.queries.RecentAttemptsByName) and returns nil; change
RecentAttempts in type Store to log the error instead of silently dropping it:
capture err from s.queries.RecentAttemptsByName and call the Store logger (or a
package logger on Store, e.g. s.logger or s.log) at warn/debug level including
the dag name and the error (and keep returning nil/empty as the function
signature requires), so failures like transient Postgres outages are visible in
logs; ensure the log message references the DagName and the underlying err from
RecentAttemptsByName.
- Around line 32-38: PoolConfig currently declares MaxIdleConns but
applyPoolConfig doesn't use it; map it to pgxpool.Config.MinIdleConns (not
MaxConns) in the applyPoolConfig function: when PoolConfig.MaxIdleConns > 0 set
cfg.MinIdleConns = PoolConfig.MaxIdleConns but cap it at cfg.MaxConns (or
PoolConfig.MaxOpenConns if that’s used to set cfg.MaxConns) to avoid
MinIdleConns > MaxConns; alternatively remove MaxIdleConns from
PoolConfig/schema if you prefer not to support this mapping. Ensure references:
PoolConfig, MaxIdleConns, applyPoolConfig, and pgxpool.Config.MinIdleConns are
updated accordingly.

In `@sqlc.yaml`:
- Around line 22-23: The sqlc override maps dagu_workspace_name to a
non-nullable go_type causing sqlc to emit interface{} + WorkspaceValid; update
sqlc.yaml to use a nullable-compatible Go type (e.g.,
"github.com/guregu/null.String" or "sql.NullString" or a pointer "string" via a
nullable override) for the field dagu_workspace_name so generated models.go has
Workspace as a proper nullable string type and removes the interface{} pattern;
after updating sqlc.yaml, regenerate code (sqlc generate) and verify models.go
contains the chosen nullable type and that existing code using attempt.Workspace
is adjusted to the new type (or add nil/Valid checks if using sql.NullString).

---

Nitpick comments:
In `@compose.postgres.yaml`:
- Around line 10-11: The volumes mount currently pins dagu-dev-postgres-data to
/var/lib/postgresql; change the target path to the PostgreSQL data directory
/var/lib/postgresql/data to match the official image convention and avoid
capturing unrelated runtime files—update the volumes entry that references
dagu-dev-postgres-data under the volumes key to mount at
/var/lib/postgresql/data instead of /var/lib/postgresql.

In `@internal/cmn/schema/config.schema.json`:
- Around line 939-973: Add a conditional schema rule so that when
DAGRunStoreDef.backend == "postgres" the DAGRunStorePostgresDef.dsn property is
required; update the DAGRunStoreDef definition to include an if/then (or allOf
with a conditional) that checks {properties: {backend: {const: "postgres"}}} and
then requires the path dag_run_store.postgres.dsn, mirroring the existing
BotsDef pattern in this file; reference DAGRunStoreDef and
DAGRunStorePostgresDef to locate where to insert the if/then constraint.

In `@internal/persis/dagrunstore/postgres/attempt.go`:
- Around line 58-63: The JSON unmarshal of row.DagData currently swallows errors
causing att.dag to remain nil and later drop DAGFileNameDataKey from events;
update newAttempt to surface failures by changing its signature to return
(attempt, error) instead of silently ignoring json.Unmarshal errors on
row.DagData, return the unmarshal error (or a wrapped error including row
identifiers) so callers can handle/log it, and update callers of newAttempt to
propagate/handle that error (or log it) so malformed DAG rows are visible in
logs/metrics.

In `@internal/persis/dagrunstore/postgres/migration_test.go`:
- Around line 16-28: The test TestMigrationUsesExistingIdentifierConstraints is
brittle because assert.Contains checks raw SQL substrings; replace those with
assert.Regexp calls using robust regexes that tolerate whitespace/quoting
(targeting migrations.FS.ReadFile("20260506000000_create_dag_run_attempts.sql")
and the same semantic patterns), e.g. compile regexes for each constraint (UUID
regex, char_length checks, lower(...) NOT IN, and allowed character classes) and
assert.Regexp(t, re, sql) for each expected constraint; alternatively, for
stronger coverage, convert the test to run the migration against an ephemeral
Postgres (using your docker/testcontainers setup) and validate the constraints
behaviorally by attempting inserts that should pass/fail.

In `@internal/persis/dagrunstore/postgres/queries/attempts.sql`:
- Around line 237-243: The RenameDAGRuns UPDATE currently matches any row with
dag_name = old_name and thus can bump updated_at on unrelated sub-DAG rows;
change the WHERE to restrict to rows whose root is old_name or to root runs
matching old_name (e.g., WHERE root_dag_name::text = sqlc.arg(old_name)::text OR
(is_root AND dag_name::text = sqlc.arg(old_name)::text)) so only runs
conceptually related to the renamed root DAG in table dagu_dag_run_attempts are
touched, leaving the existing SET logic for dag_name, root_dag_name and
updated_at intact.
- Around line 1-2: The advisory lock is effectively 32-bit because hashtext(...)
returns int4; update the LockDAGRunKey query to use the two-argument
advisory-lock form so you get a 64-bit keyspace instead of sign-extending a
single int4: replace the single
pg_advisory_xact_lock(hashtext(sqlc.arg(lock_key)::text)::bigint) call with
pg_advisory_xact_lock(<first_int4>, <second_int4>) where <first_int4> and
<second_int4> are two distinct 32-bit hashes derived from the lock components
(e.g., hashtext of dag_name and hashtext of dag_run_id or two different hash
functions on lock_key), casting each hashtext(...) result to int; ensure you
reference the SQL comment name LockDAGRunKey and the existing sqlc.arg(lock_key)
expression when extracting the components.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 832bded3-7795-4463-adab-e5d4ba7b908d

📥 Commits

Reviewing files that changed from the base of the PR and between e3a0c07 and bb74984.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (27)

Makefile
compose.postgres.yaml
go.mod
internal/cmd/context.go
internal/cmn/config/config.go
internal/cmn/config/definition.go
internal/cmn/config/key_hints.go
internal/cmn/config/loader.go
internal/cmn/config/loader_test.go
internal/cmn/schema/config.schema.json
internal/core/exec/dagrun.go
internal/engine/engine.go
internal/persis/dagrunstore/postgres/attempt.go
internal/persis/dagrunstore/postgres/cursor.go
internal/persis/dagrunstore/postgres/db/attempts.sql.go
internal/persis/dagrunstore/postgres/db/db.go
internal/persis/dagrunstore/postgres/db/models.go
internal/persis/dagrunstore/postgres/migrate.go
internal/persis/dagrunstore/postgres/migration_test.go
internal/persis/dagrunstore/postgres/migrations/20260506000000_create_dag_run_attempts.sql
internal/persis/dagrunstore/postgres/migrations/embed.go
internal/persis/dagrunstore/postgres/queries/attempts.sql
internal/persis/dagrunstore/postgres/store.go
internal/persis/dagrunstore/store.go
internal/persis/filedagrun/query_cursor.go
internal/service/frontend/api/v1/dagruns.go
sqlc.yaml

yottahmd · 2026-05-06T11:12:33Z

@coderabbitai review

coderabbitai · 2026-05-06T11:12:39Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cmd/context.go`:
- Around line 671-680: The dagrun store created by dagrunstore.New (assigned to
schedulerRunStore) must be closed if subsequent scheduler.New(...) fails; add a
short-lived defer to close schedulerRunStore (calling its Close/Release method)
immediately after successful creation and cancel that defer/transfer ownership
right after scheduler.New returns the sched instance successfully so that if
scheduler construction errors the Postgres pool is released; do the same pattern
for the other similar creation at the 683-686 site.
- Around line 907-911: Direct NewContext callers (e.g., places that call
NewContext directly instead of via NewCommand) are creating resource-owning
Contexts without invoking the new Context.Close contract, causing leaks for
Postgres-backed run stores; update each direct NewContext usage (notably the
callers referenced in the review) to ensure Close is always run by adding a
deferred close like defer ctx.Close(ctx) immediately after NewContext returns or
by refactoring the call site to go through the existing wrapper used by
NewCommand so cleanup happens automatically; look for symbol NewContext and
ensure Context.Close(ctx) is deferred (or route through NewCommand's wrapper) at
those call sites.

In `@internal/persis/dagrunstore/postgres/cursor.go`:
- Around line 95-110: The queryFilterHash implementation currently builds a
local normalized struct (named normalized) but omits opts.WorkspaceFilter, so
include a stable representation of the workspace filter in that struct before
computing the hash: add a field like WorkspaceFilter []string
`json:"workspace_filter,omitempty"` (or a single string if the filter is scalar)
to the normalized literal and initialize it from opts.WorkspaceFilter (guarding
nil -> empty slice/string for stability) inside queryFilterHash so the workspace
scope is part of the hashed cursor fingerprint.

In `@internal/persis/dagrunstore/postgres/store.go`:
- Around line 635-680: The RejectActive check in RemoveDAGRun is a TOCTOU bug
because it runs before acquiring the advisory lock; move the FindAttempt +
status check inside the existing withTx callback immediately after calling
q.LockDAGRunKey so the status is read under the same lock (mirror how
CompareAndSwapLatestAttemptStatus does it). Concretely, inside
Store.RemoveDAGRun's withTx closure, after q.LockDAGRunKey(...) call, call
s.FindAttempt(ctx, dagRun) (or query the attempt/status via the transaction if
available) and then attempt.ReadStatus(ctx); if status.Status.IsActive() return
exec.ErrDAGRunActive (formatted the same way the code currently uses) before
calling q.DeleteDAGRunRows, so deletion cannot proceed for active runs.

In `@internal/service/scheduler/scheduler.go`:
- Around line 769-771: When startup fails, ensure the DAG-run store is closed in
the cleanupFailedStartup path just like in Stop(): call
exec.CloseDAGRunStore(ctx, s.dagRunStore) from cleanupFailedStartup and log any
error with logger.Warn(ctx, "Failed to close scheduler DAG-run store",
tag.Error(err)); this reuses the same close logic used in Stop() so
s.dagRunStore isn't left open on Start() failures.

In `@Makefile`:
- Line 295: Remove the exemption for internal/persis/dagrunstore/postgres/db/**
from the addlicense ignore list in the Makefile so Go sources there are
processed by the addlicense target; if those files are generated, modify the
generator that writes files into internal/persis/dagrunstore/postgres/db (or the
sqlc post-generation step) to insert the required GPL v3 header into each *.go
file at generation time instead of keeping the directory excluded.

---

Nitpick comments:
In `@internal/persis/dagrunstore/postgres/attempt.go`:
- Around line 96-99: The Open initialization currently swallows ReadStatus
errors because lastPersistedEventTypeLocked(ctx) returns "" on failure, so
update Open (and the similar block around the other occurrence at the 267-277
region) to capture the error from lastPersistedEventTypeLocked/ReadStatus and,
if non-nil, log a warning via the package logger (including context and the
error) before falling back to the empty string; keep the existing behavior (do
not change the return flow), but ensure att.lastEmittedEventType assignment
still happens and att.opened.Store(true) is executed after logging so transient
DB read errors are at least recorded for diagnostics.
- Around line 73-99: Attempt.Open currently re-serializes and always UPDATEs
dag_data when att.dag is non-nil, causing unnecessary DB writes for attempts
initialized from stored rows; modify Open (method Attempt.Open) to only persist
the DAG when the persisted dag_data was previously empty or when SetDAG was
called after construction (i.e., track a flag like dagDirty or check a nil/empty
persistedDAG marker populated in newAttempt), so Open checks that flag (or
persistedDAG state) before calling att.queries.UpdateAttemptDAG; ensure SetDAG
sets the dirty flag and Open clears it after a successful update.

In `@internal/persis/dagrunstore/postgres/queries/attempts.sql`:
- Around line 245-249: The DeleteDAGRunRows query currently returns one row per
deleted attempt, causing duplicate dag_run_id values; change the query that
deletes from dagu_dag_run_attempts (matching root_dag_name and root_dag_run_id)
to return unique dag_run_id values by using a DISTINCT return (e.g., use
RETURNING DISTINCT dag_run_id or wrap the DELETE in a CTE and SELECT DISTINCT
dag_run_id from the deleted rows) so callers iterating results get each
dag_run_id only once.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89d98aa6-fbcc-4c2f-aad6-3794cd1b68f5

📥 Commits

Reviewing files that changed from the base of the PR and between bb74984 and 3b62c72.

📒 Files selected for processing (17)

.golangci.yml
Makefile
compose.postgres.yaml
internal/cmd/context.go
internal/cmn/schema/config.schema.json
internal/core/exec/dagrun.go
internal/engine/engine.go
internal/persis/dagrunstore/postgres/attempt.go
internal/persis/dagrunstore/postgres/attempt_test.go
internal/persis/dagrunstore/postgres/cursor.go
internal/persis/dagrunstore/postgres/db/attempts.sql.go
internal/persis/dagrunstore/postgres/db/models.go
internal/persis/dagrunstore/postgres/migration_test.go
internal/persis/dagrunstore/postgres/queries/attempts.sql
internal/persis/dagrunstore/postgres/store.go
internal/service/scheduler/scheduler.go
sqlc.yaml

yottahmd · 2026-05-06T11:43:10Z

@coderabbitai review

coderabbitai · 2026-05-06T11:43:16Z

✅ Actions performed

Review triggered.

yottahmd · 2026-05-06T12:30:33Z

@coderabbitai review

coderabbitai · 2026-05-06T12:30:40Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/persis/dagrunstore/postgres/queries/attempts.sql`:
- Around line 262-268: The RenameDAGRuns update currently only changes
relational columns; modify the UPDATE for RenameDAGRuns to also update the JSONB
status_data->>'name' for affected rows so partial-name searches remain
consistent: when root_dag_name or (is_root and dag_name) matches
sqlc.arg(old_name), use jsonb_set on status_data to replace the top-level "name"
value with sqlc.arg(new_name); keep the same WHERE logic and updated_at update
and ensure only rows whose name equals the old_name are changed (use the same
predicates used for dag_name/root_dag_name checks).
- Around line 256-260: The DeleteDAGRunRows query currently returns one
dag_run_id per deleted attempt, producing duplicates when a run had multiple
attempts; change the SQL to deduplicate the returned IDs by wrapping the DELETE
in a CTE and selecting DISTINCT dag_run_id from that CTE (so keep the query name
DeleteDAGRunRows and the DELETE against dagu_dag_run_attempts with filters
root_dag_name and root_dag_run_id, but return unique dag_run_id values). Ensure
the query still uses the same sqlc :many shape but emits each dag_run_id only
once.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e79f6a70-3c9d-4ac5-b227-b22576ef599a

📥 Commits

Reviewing files that changed from the base of the PR and between cc31f68 and 7f0751b.

📒 Files selected for processing (4)

internal/persis/dagrunstore/postgres/db/attempts.sql.go
internal/persis/dagrunstore/postgres/migrations/20260506000001_add_cleanup_latest_index.sql
internal/persis/dagrunstore/postgres/queries/attempts.sql
internal/persis/dagrunstore/postgres/retention_query_test.go

yottahmd · 2026-05-06T12:53:03Z

@coderabbitai review

coderabbitai · 2026-05-06T12:53:10Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

internal/cmn/config/loader_test.go (1)

427-567: ⚡ Quick win

Add the missing backend: postgres / empty-DSN regression case.

This suite never asserts the failure mode where Postgres is selected but every role DSN is empty, so the loader can keep accepting an unusable DAG-run store config and still leave this test green.

Suggested patch

 	t.Run("InvalidBackend", func(t *testing.T) {
 		err := loadWithErrorFromYAML(t, `
 dag_run_store:
   backend: cassandra
 `)
 
 		require.Error(t, err)
 		assert.Contains(t, err.Error(), "invalid dag_run_store.backend")
 	})
+
+	t.Run("PostgresRequiresAtLeastOneDSN", func(t *testing.T) {
+		err := loadWithErrorFromYAML(t, `
+dag_run_store:
+  backend: postgres
+`)
+
+		require.Error(t, err)
+		assert.Contains(t, err.Error(), "dag_run_store.postgres")
+	})
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cmn/config/loader_test.go` around lines 427 - 567, Add a new subtest
inside TestLoad_DAGRunStore (e.g., "PostgresMissingDSNs") that uses
loadWithErrorFromYAML to load YAML with dag_run_store.backend: postgres but no
postgres role DSNs provided, then require.Error on the result and assert the
error message indicates missing/empty DSNs (assert.Contains(err.Error(),
"dsn")). This ensures the loader rejects a postgres backend when all role DSNs
are empty; reference the TestLoad_DAGRunStore function and the helper
loadWithErrorFromYAML to implement the check.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cmd/context.go`:
- Around line 932-936: RunE currently calls ctx.Close(ctx) after runFunc
returns, but that reuses a possibly canceled command context; change the Close
call to use a non-cancelable context (e.g., context.Background() or a short
timeout derived from background) so cleanup (ctx.Close / exec.CloseDAGRunStore)
runs reliably; update the call site where runFunc is invoked and replace the ctx
parameter passed into ctx.Close(...) with a new non-cancelable context while
keeping the existing error-join logic around runErr and closeErr.

In `@internal/cmd/dagrunstore_role_test.go`:
- Around line 16-33: Add a test entry covering the "exec" command mapping:
update the test table in internal/cmd/dagrunstore_role_test.go to include {name:
"exec", want: dagrunstore.RoleAgent} so dagRunStoreRoleForCommand (the function
under test) exercises the explicit exec → RoleAgent branch (defined in
internal/cmd/context.go).

In `@internal/cmn/config/config.go`:
- Around line 637-645: validateDAGRunStore currently accepts backend=postgres
even when the Postgres connection block is empty; modify validateDAGRunStore so
that when c.DAGRunStore.Backend == DAGRunStoreBackendPostgres it verifies the
Postgres role DSN fields on c.DAGRunStore (the role DSN fields used to connect:
e.g. primary/writer/reader/replica DSNs) are not all empty and return a
descriptive fmt.Errorf (e.g. "invalid dag_run_store.postgres: all role DSNs are
empty") if they are; keep the existing switch and error path for unknown
backends and only return nil when a usable Postgres DSN is present.

In `@internal/cmn/config/key_hints.go`:
- Around line 125-154: The existing key-hints map includes legacy keys for
"dagrunstore.postgres.<role>.automigrate" but is missing the corresponding
"dag_run_store.postgres.<role>.automigrate" variants, so add entries mapping
"dag_run_store.postgres.server.automigrate" ->
"dag_run_store.postgres.server.auto_migrate",
"dag_run_store.postgres.scheduler.automigrate" ->
"dag_run_store.postgres.scheduler.auto_migrate", and
"dag_run_store.postgres.agent.automigrate" ->
"dag_run_store.postgres.agent.auto_migrate" (matching the pattern used for pool
keys) so viper-lowercased "automigrate" is recognized and routed to the existing
auto_migrate config names.

In `@internal/engine/engine.go`:
- Around line 72-81: After creating dagRunStore in the engine constructor,
ensure the store is closed if later initialization fails: add a deferred cleanup
immediately after dagRunStore, e.g. defer func() { if dagRunStore != nil { _ =
dagRunStore.Close() } }() so any early return (e.g. from newDAGStore or other
failures in New) will release the Postgres pool; then, once the engine is
successfully constructed, set dagRunStore = nil (or otherwise cancel the defer)
before returning to avoid closing the live store. Reference dagRunStore,
newDAGStore and the constructor function (New) when applying this change.

---

Nitpick comments:
In `@internal/cmn/config/loader_test.go`:
- Around line 427-567: Add a new subtest inside TestLoad_DAGRunStore (e.g.,
"PostgresMissingDSNs") that uses loadWithErrorFromYAML to load YAML with
dag_run_store.backend: postgres but no postgres role DSNs provided, then
require.Error on the result and assert the error message indicates missing/empty
DSNs (assert.Contains(err.Error(), "dsn")). This ensures the loader rejects a
postgres backend when all role DSNs are empty; reference the
TestLoad_DAGRunStore function and the helper loadWithErrorFromYAML to implement
the check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 576b3d9d-ac6b-4a0c-a41b-a6f985422d1d

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0751b and d63d0af.

📒 Files selected for processing (12)

Makefile
internal/cmd/context.go
internal/cmd/dagrunstore_role_test.go
internal/cmn/config/config.go
internal/cmn/config/definition.go
internal/cmn/config/key_hints.go
internal/cmn/config/loader.go
internal/cmn/config/loader_test.go
internal/cmn/schema/config.schema.json
internal/engine/engine.go
internal/persis/dagrunstore/store.go
internal/persis/dagrunstore/store_test.go

yottahmd · 2026-05-06T13:16:35Z

@coderabbitai review

coderabbitai · 2026-05-06T13:16:41Z

✅ Actions performed

Review triggered.

coderabbitai

♻️ Duplicate comments (5)

internal/engine/engine.go (1)

72-87: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close the DAG-run store on constructor rollback.

If dagrunstore.New(...) succeeds and newDAGStore(...) fails, the store is leaked because cleanup only happens in (*Engine).Close. Add a deferred rollback cleanup immediately after construction and cancel it before the successful return.

Suggested patch

 	dagRunStore, err := dagrunstore.New(
 		ctx,
 		cfg,
 		dagrunstore.WithRole(dagrunstore.RoleAgent),
 		dagrunstore.WithLatestStatusToday(false),
 		dagrunstore.WithLocation(cfg.Core.Location),
 	)
 	if err != nil {
 		return nil, fmt.Errorf("initialize DAG-run store: %w", err)
 	}
+	cleanupDAGRunStore := true
+	defer func() {
+		if cleanupDAGRunStore {
+			_ = coreexec.CloseDAGRunStore(context.Background(), dagRunStore)
+		}
+	}()
 	serviceRegistry := fileserviceregistry.New(cfg.Paths.ServiceRegistryDir)
 	dagRunMgr := runtime.NewManager(dagRunStore, procStore, cfg)
 
 	dagStore, err := newDAGStore(cfg, nil, false)
 	if err != nil {
 		return nil, err
 	}
+	cleanupDAGRunStore = false

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/engine/engine.go` around lines 72 - 87, After creating dagRunStore
via dagrunstore.New in the constructor, add a deferred rollback that
closes/releases dagRunStore if subsequent initialization fails (e.g., when
newDAGStore returns an error); implement this by deferring a cleanup closure
right after dagRunStore is assigned that checks whether the constructor is still
failing and calls the appropriate dagRunStore.Close/Release method, and ensure
you cancel/disable that deferred rollback (or set err=nil / flip a success flag)
just before returning the successfully-constructed Engine so the store is not
closed on success. Reference dagrunstore.New, dagRunStore, newDAGStore and the
constructor that currently defers cleanup only in Engine.Close to locate where
to add and cancel the rollback.

internal/cmn/config/key_hints.go (1)

125-154: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the dag_run_store.*.automigrate aliases too.

This block already remaps dagrunstore.postgres.<role>.automigrate, but it still misses the partially migrated dag_run_store.postgres.<role>.automigrate forms. With Viper lowercasing autoMigrate to automigrate, those configs silently fall back to the default auto_migrate value instead of honoring the user's setting.

Suggested patch

 	"dagrunstore.postgres.agent.pool.connmaxlifetime":       "dag_run_store.postgres.agent.pool.conn_max_lifetime",
 	"dagrunstore.postgres.agent.pool.connmaxidletime":       "dag_run_store.postgres.agent.pool.conn_max_idle_time",
+	"dag_run_store.postgres.server.automigrate":             "dag_run_store.postgres.server.auto_migrate",
+	"dag_run_store.postgres.scheduler.automigrate":          "dag_run_store.postgres.scheduler.auto_migrate",
+	"dag_run_store.postgres.agent.automigrate":              "dag_run_store.postgres.agent.auto_migrate",
 	"dag_run_store.postgres.server.pool.maxopenconns":       "dag_run_store.postgres.server.pool.max_open_conns",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cmn/config/key_hints.go` around lines 125 - 154, The automigrate
aliases for the partially-migrated keys are missing: add mappings for
"dag_run_store.postgres.server.automigrate",
"dag_run_store.postgres.scheduler.automigrate", and
"dag_run_store.postgres.agent.automigrate" and point each to the canonical
snake_case target (e.g. map "dag_run_store.postgres.server.automigrate" ->
"dag_run_store.postgres.server.auto_migrate"); mirror the existing
"dagrunstore.postgres.*.automigrate" entries so Viper lowercased "automigrate"
settings are honored in the dag_run_store namespace (update file key_hints.go,
alongside the existing dagrunstore and dag_run_store pool mappings).

internal/cmn/config/config.go (1)

637-645: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject backend=postgres when every role DSN is empty.

validateDAGRunStore() currently accepts an unusable Postgres config and defers the failure until dagrunstore.New(...). That makes a bad config pass Load() and fail only at runtime.

Suggested patch

 func (c *Config) validateDAGRunStore() error {
 	switch c.DAGRunStore.Backend {
 	case "", DAGRunStoreBackendFile:
 		return nil
 	case DAGRunStoreBackendPostgres:
+		if c.DAGRunStore.Postgres.Server.DSN == "" &&
+			c.DAGRunStore.Postgres.Scheduler.DSN == "" &&
+			c.DAGRunStore.Postgres.Agent.DSN == "" {
+			return fmt.Errorf("invalid dag_run_store.postgres: all role DSNs are empty")
+		}
 		return nil
 	default:
 		return fmt.Errorf("invalid dag_run_store.backend: %q", c.DAGRunStore.Backend)
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cmn/config/config.go` around lines 637 - 645, validateDAGRunStore()
currently accepts DAGRunStoreBackendPostgres even when no Postgres role DSNs are
configured; update validateDAGRunStore to, when c.DAGRunStore.Backend ==
DAGRunStoreBackendPostgres, inspect the Postgres role DSN fields (e.g.
ReaderDSN, WriterDSN, MigratorDSN or the equivalent fields on
c.DAGRunStore.Postgres) and return an error if all of them are empty (use a
clear message like "invalid dag_run_store.backend: postgres selected but no role
DSNs configured"); keep the existing behavior for the file backend and other
checks intact.

internal/cmd/dagrunstore_role_test.go (1)

16-28: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test table still missing the exec mapping.

The previous review asked for an exec → RoleAgent row, and that comment was marked addressed, but the table here still omits it. Since dagRunStoreRoleForCommand in context.go (line 567) explicitly handles exec, this branch remains untested.

🔧 Proposed fix

 		{name: "start", want: dagrunstore.RoleAgent},
 		{name: "restart", want: dagrunstore.RoleAgent},
 		{name: "retry", want: dagrunstore.RoleAgent},
 		{name: "dry", want: dagrunstore.RoleAgent},
+		{name: "exec", want: dagrunstore.RoleAgent},
 		{name: "worker", want: dagrunstore.RoleAgent},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cmd/dagrunstore_role_test.go` around lines 16 - 28, The test table
in internal/cmd/dagrunstore_role_test.go is missing the "exec" case so
dagRunStoreRoleForCommand in context.go isn't covered; add a test row with name:
"exec" and want: dagrunstore.RoleAgent to the existing tests slice (alongside
entries like "start" and "worker") so the exec → RoleAgent mapping is exercised
by the test suite.

internal/cmd/context.go (1)

🧹 Nitpick comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/cmd/context.go`:
- Around line 932-936: The cleanup is calling ctx.Close(ctx) using the
possibly-canceled command context; change it to call Close with a fresh
non-cancelable (or short bounded) context derived from context.Background() so
cleanup (e.g., exec.CloseDAGRunStore) can complete even after the original ctx
is Done; locate the call site around runFunc and ctx.Close in RunE and replace
passing the embedded command context with a new background or
context.WithTimeout(context.Background(), ...) context, adding necessary imports
(context and optionally time).

In `@internal/cmd/dagrunstore_role_test.go`:
- Around line 16-28: The test table in internal/cmd/dagrunstore_role_test.go is
missing the "exec" case so dagRunStoreRoleForCommand in context.go isn't
covered; add a test row with name: "exec" and want: dagrunstore.RoleAgent to the
existing tests slice (alongside entries like "start" and "worker") so the exec →
RoleAgent mapping is exercised by the test suite.

In `@internal/cmn/config/config.go`:
- Around line 637-645: validateDAGRunStore() currently accepts
DAGRunStoreBackendPostgres even when no Postgres role DSNs are configured;
update validateDAGRunStore to, when c.DAGRunStore.Backend ==
DAGRunStoreBackendPostgres, inspect the Postgres role DSN fields (e.g.
ReaderDSN, WriterDSN, MigratorDSN or the equivalent fields on
c.DAGRunStore.Postgres) and return an error if all of them are empty (use a
clear message like "invalid dag_run_store.backend: postgres selected but no role
DSNs configured"); keep the existing behavior for the file backend and other
checks intact.

In `@internal/cmn/config/key_hints.go`:
- Around line 125-154: The automigrate aliases for the partially-migrated keys
are missing: add mappings for "dag_run_store.postgres.server.automigrate",
"dag_run_store.postgres.scheduler.automigrate", and
"dag_run_store.postgres.agent.automigrate" and point each to the canonical
snake_case target (e.g. map "dag_run_store.postgres.server.automigrate" ->
"dag_run_store.postgres.server.auto_migrate"); mirror the existing
"dagrunstore.postgres.*.automigrate" entries so Viper lowercased "automigrate"
settings are honored in the dag_run_store namespace (update file key_hints.go,
alongside the existing dagrunstore and dag_run_store pool mappings).

In `@internal/engine/engine.go`:
- Around line 72-87: After creating dagRunStore via dagrunstore.New in the
constructor, add a deferred rollback that closes/releases dagRunStore if
subsequent initialization fails (e.g., when newDAGStore returns an error);
implement this by deferring a cleanup closure right after dagRunStore is
assigned that checks whether the constructor is still failing and calls the
appropriate dagRunStore.Close/Release method, and ensure you cancel/disable that
deferred rollback (or set err=nil / flip a success flag) just before returning
the successfully-constructed Engine so the store is not closed on success.
Reference dagrunstore.New, dagRunStore, newDAGStore and the constructor that
currently defers cleanup only in Engine.Close to locate where to add and cancel
the rollback.

---

Nitpick comments:
In `@Makefile`:
- Around line 129-157: The dev-pg Makefile target packs container orchestration,
readiness polling, and app startup into one long recipe; split it into smaller
targets (e.g., dev-pg-up to run the docker compose up postgres step, dev-pg-wait
to run the pg_isready loop and error handling, and dev-pg-start to set the
DAGU_* env vars and run go run ./cmd start-all) and make dev-pg depend on those
helpers (dev-pg: ${FE_BUNDLE_JS} dev-pg-up dev-pg-wait dev-pg-start) so each
step is isolated, easier to lint, test and debug and the checkmake maxbodylength
warning is resolved.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b84c8f5-a05e-4160-9eb1-2cc3b3757753

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0751b and 857ce02.

📒 Files selected for processing (13)

Makefile
internal/cmd/context.go
internal/cmd/dagrunstore_role_test.go
internal/cmn/config/config.go
internal/cmn/config/definition.go
internal/cmn/config/key_hints.go
internal/cmn/config/loader.go
internal/cmn/config/loader_test.go
internal/cmn/schema/config.schema.json
internal/engine/engine.go
internal/persis/dagrunstore/postgres/retention_query_test.go
internal/persis/dagrunstore/store.go
internal/persis/dagrunstore/store_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/cmn/config/definition.go

yottahmd · 2026-05-06T14:03:05Z

@coderabbitai review

coderabbitai · 2026-05-06T14:03:12Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cmd/context.go`:
- Around line 361-364: NewContext eagerly calls dagrunstore.New which forces DB
access for every local/static command and duplicates stores when NewScheduler
also opens its own; change this by deferring or gating dagrunstore
initialization: detect whether the current command path requires runtime DAG-run
persistence (e.g., scheduler/commands that operate on DAG-run state) and only
call dagrunstore.New in those code paths, or convert the store into a
lazily-initialized singleton accessed via a getter that creates the store on
first use; also update NewScheduler to accept an optional store (or use the
shared lazy getter) so it does not open a second pool.

In `@internal/cmn/config/config.go`:
- Around line 637-647: The validateDAGRunStore function currently only checks
for empty DSNs but does not validate the Postgres pool subfields; add validation
of DAGRunStore.Postgres.Server.Pool, .Scheduler.Pool and .Agent.Pool by reusing
a shared helper (e.g. validatePostgresPool or validatePoolConfig) that enforces
allowed ranges/types (non-negative/max open/conns >= 0, min idle >= 0 and <=
max, connection max lifetime/dial timeout > 0 where required, etc.); call that
helper from validateDAGRunStore for each non-nil role pool and ensure the same
helper is used by worker.postgres_pool validation to keep behavior consistent.

In `@Makefile`:
- Around line 147-156: The Makefile currently hardcodes a 30-second retry loop
for PostgreSQL readiness; make it configurable by adding variables (e.g.,
DEV_PG_WAIT_RETRIES with a default of 30 and DEV_PG_WAIT_SLEEP with a default of
1) and replace the literal 30 and sleep 1 in the loop with those variables (use
$$DEV_PG_WAIT_RETRIES and $$DEV_PG_WAIT_SLEEP in the shell block) so callers can
override them from the environment or make command line; ensure the final check
uses the configured retries (if [ $$i -ge $$DEV_PG_WAIT_RETRIES ]) and preserve
existing logging using COLOR_RED and COLOR_RESET.

---

Outside diff comments:
In `@internal/intg/queue/queue_test.go`:
- Around line 120-137: TestLocalQueueFIFOProcessing currently asserts sequential
starts with a hardcoded 900ms threshold which can flake on Windows; update the
test to mirror the widened timing/slack logic used in TestGlobalConcurrency by
introducing and reusing the same sleep duration and threshold variables (the
ones passed into directSleepStepYAML and used to compute the allowed inter-start
interval) instead of the literal 900*time.Millisecond, and use that threshold in
the require.GreaterOrEqual assertion; reference TestLocalQueueFIFOProcessing,
directSleepStepYAML, newFixture, WaitDrain, collectStartTimes and copy the
duration/threshold calculation from TestGlobalConcurrency so the timing budget
matches across tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 519c83aa-b563-4495-9758-ff4916d6b490

📥 Commits

Reviewing files that changed from the base of the PR and between 857ce02 and bb7f636.

📒 Files selected for processing (12)

Makefile
internal/cmd/context.go
internal/cmd/dagrunstore_role_test.go
internal/cmd/validate.go
internal/cmn/config/config.go
internal/cmn/config/key_hints.go
internal/cmn/config/loader_test.go
internal/engine/engine.go
internal/intg/queue/queue_test.go
internal/persis/dagrunstore/postgres/db/attempts.sql.go
internal/persis/dagrunstore/postgres/queries/attempts.sql
internal/persis/dagrunstore/postgres/retention_query_test.go

yottahmd · 2026-05-06T14:23:09Z

@coderabbitai review

coderabbitai · 2026-05-06T14:23:17Z

✅ Actions performed

Review triggered.

coderabbitai

🧹 Nitpick comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Makefile`:
- Around line 131-165: Add a symmetric "dev-pg-down" Makefile target to stop and
remove the PostgreSQL container started by dev-pg/dev-pg-up: create a PHONY
dev-pg-down target that runs docker compose -p ${DEV_PG_COMPOSE_PROJECT} -f
${DEV_PG_COMPOSE_FILE} down (optionally with -v if you want to remove volumes)
and prints a clear message; reference the existing dev-pg, dev-pg-up,
dev-pg-start, and dev-pg-wait targets so maintainers can call it from docs or CI
and choose whether to preserve volumes by default or document using -v for full
reset.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5233d196-0485-4eea-b62a-b6b76163acd2

📥 Commits

Reviewing files that changed from the base of the PR and between bb7f636 and 8d05879.

📒 Files selected for processing (6)

Makefile
internal/cmd/context.go
internal/cmd/context_hardening_test.go
internal/cmn/config/config.go
internal/cmn/config/loader.go
internal/cmn/config/loader_test.go

👮 Files not reviewed due to content moderation or server errors (5)

internal/cmn/config/loader.go
internal/cmd/context.go
internal/cmd/context_hardening_test.go
internal/cmn/config/loader_test.go
internal/cmn/config/config.go

yottahmd · 2026-05-06T17:11:38Z

@coderabbitai review

coderabbitai · 2026-05-06T17:11:46Z

✅ Actions performed

Review triggered.

Add an opt-in PostgreSQL control-plane store with goose migrations, sqlc queries, UUIDv7 identifiers, role-specific connection settings, and PostgreSQL-backed DAG runs, queues, distributed state, auth, audit, workspace, session, webhook, and event stores. Keep file storage as the default, require explicit agent direct_access for local development, add dev-pg wiring with postgres:18, and add a separate PostgreSQL integration CI job.

yottahmd · 2026-05-07T10:57:41Z

@coderabbitai review

coderabbitai · 2026-05-07T10:59:28Z

✅ Actions performed

Review triggered.

coderabbitai · 2026-05-07T11:27:47Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (5)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/agent/api.go`:
- Line 1104: The current use of uuid.Must(uuid.NewV7()) (e.g., the id assignment
at the shown location and the similar usages at the other occurrences around
lines 1164 and 1279) panics instead of returning errors; replace these with
non-panicking generation by calling uuid.NewV7(), checking the returned error,
and propagating or wrapping that error via the function's existing error return
(or returning a descriptive error) where id is assigned, so the caller can
handle failures; alternatively, if the team prefers to keep uuid.Must(),
document and apply that decision consistently across all production UUID usages
rather than changing only these three sites.

In `@internal/cmd/context.go`:
- Around line 862-913: The newly-created schedulerControlPlaneStore from
controlplanestore.New is never closed; add ownership tracking and proper cleanup
similar to schedulerRunStore: set a bool (e.g., ownsSchedulerControlPlaneStore)
to true when you create schedulerControlPlaneStore in the else branch, add a
defer that calls schedulerControlPlaneStore.Close() (or the appropriate close
method) if ownsSchedulerControlPlaneStore is true so it will be closed on early
returns (including on scheduler.New error), and ensure you clear the flag (set
to false) only when ownership is intentionally transferred to the longer-lived
shutdown path (e.g., register it so Context.Close or the scheduler shutdown will
call schedulerControlPlaneStore.Close()); reference schedulerControlPlaneStore,
controlplanestore.New, ownsSchedulerRunStore/exec.CloseDAGRunStore,
scheduler.New, and Context.Close when making these changes.

In `@internal/cmn/schema/config.schema.json`:
- Around line 1011-1031: The schema currently allows
ControlPlaneStorePostgresAgentDef objects with direct_access: true but no dsn;
update the ControlPlaneStorePostgresAgentDef definition to enforce that when
"direct_access" is true the "dsn" property is required by adding a JSON Schema
conditional (an "if" that checks properties.direct_access.const true and a
matching "then" that adds required: ["dsn"]) so validation fails unless dsn is
present; reference the ControlPlaneStorePostgresAgentDef object and its
properties "direct_access" and "dsn" when making the change.

In `@internal/persis/controlplanestore/postgres/audit_store.go`:
- Around line 95-96: Clamp the pagination integers before converting to int32:
ensure `limit` is bounded (e.g., min 0, max 500 or math.MaxInt32 if you want a
hard upper clamp) and ensure `offset` is non-negative and not greater than
math.MaxInt32, then cast to int32 when assigning `RowLimit` and `RowOffset` in
the code that builds the query (the fields named RowLimit and RowOffset in
audit_store.go). Replace the direct casts int32(limit)/int32(offset) with a
small helper or inline clamping logic that applies bounds and then casts to
int32 to prevent overflow/negative values from internal callers.

In `@internal/persis/controlplanestore/postgres/auth_store.go`:
- Around line 412-424: Update currently skips the same DAG name and token-hash
validation done in Create, allowing invalid/empty values to be persisted; modify
webhookStore.Update (around parseUUIDv7, updateWebhookParams, and before
s.store.queries.UpdateWebhook) to run the same validation logic used by Create:
validate the webhook.DagName with the same validate function (or logic) used in
Create and reject invalid DAG names, and ensure webhook.TokenHash is non-empty
(return the same error values Create returns). Keep the checks before calling
updateWebhookParams and queries.UpdateWebhook so invalid updates are rejected
early.
- Around line 155-189: The Update method can write a user with only one of
OIDCIssuer / OIDCSubject set, breaking the invariant used by GetByOIDCIdentity;
before marshalling or calling s.store.queries.UpdateUser in userStore.Update,
validate that either both user.OIDCIssuer and user.OIDCSubject are empty or both
are non-empty, and if that check fails return a suitable error (e.g.
auth.ErrInvalidOIDCIdentity or another existing auth error) so Update rejects
partial OIDC identity updates just like Create does.

In `@internal/persis/controlplanestore/postgres/dagrun_query.go`:
- Around line 90-108: latestRootAttempt currently treats any non-nil error from
s.queries.FindRootRun as a "not found" (exec.ErrDAGRunIDNotFound), masking real
DB errors; change the logic in latestRootAttempt so that after LatestRootAttempt
returns pgx.ErrNoRows you call FindRootRun and: if anyErr == nil return
exec.ErrNoStatusData, else if errors.Is(anyErr, pgx.ErrNoRows) return
exec.ErrDAGRunIDNotFound, else return the actual anyErr (propagate the real DB
error). Ensure you use errors.Is and reference LatestRootAttempt, FindRootRun,
exec.ErrNoStatusData and exec.ErrDAGRunIDNotFound so callers and alerts receive
real DB errors instead of a false "missing run".

In `@internal/persis/controlplanestore/postgres/distributed.go`:
- Around line 411-430: The JSON blob is marshaled before the synthesized root is
written back to the record, so the DB row columns get the defaulted root but
row.Data keeps an empty Root; modify the block in UpsertActiveDistributedRun so
that after you compute root (the variable root and the zero check) you assign
record.Root = root (or a shallow copy) before calling json.Marshal(record), then
proceed with the UpsertActiveDistributedRun call so the serialized Data contains
the persisted defaulted root.
- Around line 325-349: The code normalizes root into the local variable root but
still marshals the original lease (so Data keeps a zero Root); before calling
json.Marshal(lease) assign the normalized root back into the lease struct (e.g.,
lease.Root = root) so the serialized Data matches the typed columns used in
UpsertDAGRunLease and reads observe the fallback Root.DAGRun value.

In `@internal/persis/controlplanestore/postgres/queries/attempts.sql`:
- Around line 39-46: The FindSubRun SQL query currently omits dag_name which
allows collisions when different child DAGs share the same dag_run_id; update
the query (named FindSubRun) to add "AND dag_name = sqlc.arg(dag_name)" to the
WHERE clause so the lookup keys are (root_dag_name, root_dag_run_id, dag_name,
dag_run_id), then regenerate sqlc bindings and update the call site
createSubAttempt to pass the new parameter (e.g., DagName: dag.Name) so the
newly created attempt is attached to the correct sub-run.

In `@internal/persis/controlplanestore/postgres/queries/auth.sql`:
- Around line 127-132: The JSONB denormalization keys are inconsistent:
UpdateAPIKeyLastUsed writes '{last_used_at}' while UpdateWebhookLastUsed uses
'{lastUsedAt}'; update the webhook-related SQL (functions/queries named
UpdateWebhookLastUsed and any other occurrences around the referenced block(s))
to use the snake_case key '{last_used_at}' (and likewise change any other
occurrences at lines ~200-205) so the JSONB key matches UpdateAPIKeyLastUsed and
the Go struct JSON tags; ensure jsonb_set/to_jsonb calls and any JSON-path
expressions use the unified '{last_used_at}' key.

In `@internal/persis/controlplanestore/postgres/queries/session.sql`:
- Around line 31-41: The two SQLC queries ListAgentSessionsByUser and
ListAgentSubSessions currently return unbounded result sets; add a hard safety
cap (e.g. append "LIMIT sqlc.arg(limit)" and validate/clip the incoming limit to
a safe maximum such as 1000) and update the store API to accept pagination
parameters (limit and optional offset) so session_store.go callers can page
results; ensure the store implementation enforces a maxLimit constant (1000)
before passing to the SQL queries to prevent OOMs even if callers request larger
values.

In `@internal/persis/controlplanestore/postgres/registry.go`:
- Around line 89-97: The loop currently ignores errors from
s.queries.DeleteServiceInstance after canceling reg.cancel and waiting reg.wg,
so deletions can silently fail (especially if ctx is canceled). Change the
delete call to run with a fresh bounded context (e.g., use context.Background()
or context.WithTimeout when ctx is already canceled) and capture the returned
error instead of discarding it; if deletion fails log the error (or retry once)
including reg.serviceName and reg.hostInfo.ID so the failure is observable and
not left to serviceStaleTimeout. Ensure you update the code around
registrations, reg.cancel, reg.wg.Wait and s.queries.DeleteServiceInstance to
use the new ctx and handle/log the error.

In `@internal/persis/controlplanestore/postgres/session_store.go`:
- Around line 187-241: The AddMessage transaction reads the session with
q.GetAgentSession without a row-level lock, allowing lost-update races; change
the sqlc query to add FOR UPDATE (e.g., create "GetAgentSessionForUpdate :one"
selecting FROM dagu_agent_sessions ... FOR UPDATE) and replace the call to
q.GetAgentSession with q.GetAgentSessionForUpdate inside the withTx block so
concurrent AddMessage calls serialize updates to Title and UpdatedAt; ensure the
new query is used in the same transaction context (withTx) and update any error
handling references accordingly.
- Around line 170-177: Currently the code silently replaces any invalid msg.ID
by calling ensureUUIDv7String and falling back to newUUIDv7String; change this
so you only auto-generate a new UUID when msg.ID is empty and otherwise validate
the provided ID: call ensureUUIDv7String only to validate a non-empty msg.ID and
return an explicit error if it is invalid, and call newUUIDv7String only when
msg.ID == "" to populate msg.ID; update the block around
ensureUUIDv7String/newUUIDv7String and the assignment to msg.ID accordingly.

In `@internal/persis/filedistributed/dispatch_task_store.go`:
- Line 81: The code currently uses uuid.Must(uuid.NewV7()) which panics on
failure; change both places (in Enqueue where fileName :=
fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(),
uuid.Must(uuid.NewV7()).String()) and in ClaimNext where uuid.Must is used) to
call uuid.NewV7() directly, check the returned error, and propagate a wrapped
error (return fmt.Errorf(...): include context like "generating task UUID" or
"generating claim UUID") instead of panicking; update the fileName construction
to use the validated uuid string and ensure both Enqueue and ClaimNext return
that error to the caller.

In `@internal/service/oidcprovision/service.go`:
- Around line 159-161: Replace the panic-prone uuid.Must call in the request
path by calling uuid.NewV7() directly, checking its returned error, and
returning a graceful error response from the OIDC provisioning handler instead
of allowing a panic; update the code that constructs the auth.User (where ID is
set) to use the validated UUID string only when err == nil, and propagate a
suitable error/HTTP response from the function that creates the user when
uuid.NewV7() fails.

---

Nitpick comments:
In `@internal/cmd/context.go`:
- Around line 287-298: The code currently constructs a file-backed event store
unconditionally when cfg.EventStore.Enabled is true, then later replaces it with
a control-plane-backed store; avoid that wasted I/O by first checking whether a
control-plane store will be used via shouldOpenContextControlPlaneStore(cmd,
cfg) and only call fileeventstore.New(cfg.Paths.EventStoreDir) and set eventSvc
when that check is false. Alternatively, move the file-store branch to after the
controlStore resolution and only create the file-backed store if controlStore is
not being opened; if you must construct a temporary file-backed store and later
replace it, ensure the discarded resource is closed (add a Close call on the
file store or eventSvc before replacement). Use the symbols fileeventstore.New,
cfg.EventStore.Enabled, shouldOpenContextControlPlaneStore, controlStore, and
eventstore.New(controlStore.Events()) to locate and change the logic.

In `@internal/persis/controlplanestore/postgres/cursor.go`:
- Around line 136-141: queryFilterHash currently panics on json.Marshal errors;
change its signature to return (string, error) and return an error instead of
calling panic when json.Marshal fails, then update both encodeQueryCursor and
decodeQueryCursor to accept and propagate that error (i.e., call
queryFilterHash, check the returned error, and return or wrap it as appropriate)
so serialization failures are handled without crashing the goroutine.

In `@internal/persis/controlplanestore/postgres/queue.go`:
- Around line 217-223: The QueueWatcher function currently discards its context
parameter; update QueueWatcher(ctx) to store the incoming context on the
returned postgresQueueWatcher (e.g., add a parentCtx/context field to
postgresQueueWatcher) and then have the watcher’s Start method observe both the
factory parentCtx and the Start-time ctx (for example by selecting on
parentCtx.Done() and startCtx.Done(), or by deriving a combined context via
context.WithCancel that is cancelled when either parentCtx or startCtx is done);
alternatively, if the watcher lifetime is intentionally only owned by
Start/Stop, rename the parameter to _ context.Context and document that behavior
— change the QueueWatcher, postgresQueueWatcher struct, and Start/Stop usage
accordingly so the factory context is not silently ignored.

In `@internal/runtime/builtin/redis/lock.go`:
- Around line 37-43: NewLockManager currently uses uuid.Must(uuid.NewV7()) which
will panic on entropy failure; change NewLockManager signature to return
(*LockManager, error), call uuid.NewV7() directly, check the returned error, and
propagate it (return nil, err) instead of panicking; on success construct the
LockManager with lockVal set from the generated UUID and return (manager, nil),
and update callers to handle the error from NewLockManager.
- Around line 126-146: Extend has a TOCTOU between Get and Expire allowing
another holder to be extended; replace the two-call sequence in
LockManager.Extend with an atomic Redis EVAL using a Lua script (like the
suggested snippet) that checks GET KEYS[1] == ARGV[1] and then PEXPIRE KEYS[1]
with ARGV[2], invoke it via m.client.Eval with KEYS = []string{m.lockKey} and
ARGV = {m.lockVal, duration in milliseconds}, and treat a non-zero return as
success and zero as "lock not owned" (return an error); you can store the script
as extendScript (analogous to unlockScript) for reuse.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c048cca-d01c-4c47-962b-c1aee6958136

📥 Commits

Reviewing files that changed from the base of the PR and between 548a907 and 17c1c81.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (93)

.github/workflows/ci.yaml
.golangci.yml
Makefile
go.mod
internal/agent/api.go
internal/agent/approval.go
internal/agent/ask_user.go
internal/agent/delegate.go
internal/agent/inputspill.go
internal/agent/remote_agent.go
internal/agent/session.go
internal/auth/apikey.go
internal/auth/user.go
internal/auth/webhook.go
internal/cmd/context.go
internal/cmd/context_hardening_test.go
internal/cmd/dagrunstore_role_test.go
internal/cmd/start.go
internal/cmd/validate.go
internal/cmn/config/config.go
internal/cmn/config/definition.go
internal/cmn/config/key_hints.go
internal/cmn/config/loader.go
internal/cmn/config/loader_test.go
internal/cmn/schema/config.schema.json
internal/core/exec/queue.go
internal/engine/engine.go
internal/persis/controlplanestore/file.go
internal/persis/controlplanestore/postgres/attempt.go
internal/persis/controlplanestore/postgres/attempt_test.go
internal/persis/controlplanestore/postgres/audit_store.go
internal/persis/controlplanestore/postgres/auth_store.go
internal/persis/controlplanestore/postgres/cursor.go
internal/persis/controlplanestore/postgres/cursor_test.go
internal/persis/controlplanestore/postgres/dagrun_convert.go
internal/persis/controlplanestore/postgres/dagrun_create.go
internal/persis/controlplanestore/postgres/dagrun_list.go
internal/persis/controlplanestore/postgres/dagrun_query.go
internal/persis/controlplanestore/postgres/dagrun_retention.go
internal/persis/controlplanestore/postgres/db/attempts.sql.go
internal/persis/controlplanestore/postgres/db/audit.sql.go
internal/persis/controlplanestore/postgres/db/auth.sql.go
internal/persis/controlplanestore/postgres/db/db.go
internal/persis/controlplanestore/postgres/db/distributed.sql.go
internal/persis/controlplanestore/postgres/db/event.sql.go
internal/persis/controlplanestore/postgres/db/models.go
internal/persis/controlplanestore/postgres/db/queue.sql.go
internal/persis/controlplanestore/postgres/db/registry.sql.go
internal/persis/controlplanestore/postgres/db/session.sql.go
internal/persis/controlplanestore/postgres/db/workspace.sql.go
internal/persis/controlplanestore/postgres/distributed.go
internal/persis/controlplanestore/postgres/event_store.go
internal/persis/controlplanestore/postgres/id.go
internal/persis/controlplanestore/postgres/integration_test.go
internal/persis/controlplanestore/postgres/migrate.go
internal/persis/controlplanestore/postgres/migration_test.go
internal/persis/controlplanestore/postgres/migrations/20260506000000_create_control_plane_store.sql
internal/persis/controlplanestore/postgres/migrations/embed.go
internal/persis/controlplanestore/postgres/queries/attempts.sql
internal/persis/controlplanestore/postgres/queries/audit.sql
internal/persis/controlplanestore/postgres/queries/auth.sql
internal/persis/controlplanestore/postgres/queries/distributed.sql
internal/persis/controlplanestore/postgres/queries/event.sql
internal/persis/controlplanestore/postgres/queries/queue.sql
internal/persis/controlplanestore/postgres/queries/registry.sql
internal/persis/controlplanestore/postgres/queries/session.sql
internal/persis/controlplanestore/postgres/queries/workspace.sql
internal/persis/controlplanestore/postgres/queue.go
internal/persis/controlplanestore/postgres/registry.go
internal/persis/controlplanestore/postgres/retention_query_test.go
internal/persis/controlplanestore/postgres/session_store.go
internal/persis/controlplanestore/postgres/store.go
internal/persis/controlplanestore/postgres/workspace_store.go
internal/persis/controlplanestore/store.go
internal/persis/controlplanestore/store_test.go
internal/persis/dagrunstore/store.go
internal/persis/dagrunstore/store_test.go
internal/persis/filedistributed/dispatch_task_store.go
internal/persis/fileeventstore/store.go
internal/remotenode/remotenode.go
internal/runtime/builtin/redis/lock.go
internal/service/audit/entry.go
internal/service/coordinator/handler.go
internal/service/frontend/api/v1/dagruns.go
internal/service/frontend/server.go
internal/service/frontend/server_test.go
internal/service/frontend/sse/multiplex.go
internal/service/frontend/terminal/connection.go
internal/service/oidcprovision/service.go
internal/service/scheduler/queue_processor.go
internal/service/worker/poller.go
internal/workspace/workspace.go
sqlc.yaml

✅ Files skipped from review due to trivial changes (19)

internal/service/frontend/sse/multiplex.go
.golangci.yml
internal/service/frontend/terminal/connection.go
internal/persis/controlplanestore/postgres/queries/workspace.sql
internal/service/coordinator/handler.go
internal/persis/controlplanestore/postgres/db/registry.sql.go
internal/persis/controlplanestore/postgres/migrations/embed.go
internal/persis/controlplanestore/postgres/db/workspace.sql.go
internal/persis/controlplanestore/postgres/db/event.sql.go
internal/persis/controlplanestore/postgres/queries/distributed.sql
internal/persis/controlplanestore/postgres/db/db.go
internal/persis/controlplanestore/postgres/queries/queue.sql
internal/persis/controlplanestore/postgres/db/queue.sql.go
internal/persis/controlplanestore/postgres/attempt.go
internal/persis/controlplanestore/postgres/db/session.sql.go
internal/persis/controlplanestore/postgres/db/distributed.sql.go
internal/persis/controlplanestore/postgres/db/models.go
internal/persis/controlplanestore/postgres/db/attempts.sql.go
internal/persis/controlplanestore/postgres/db/auth.sql.go

🚧 Files skipped from review as they are similar to previous changes (8)

internal/service/frontend/api/v1/dagruns.go
internal/cmd/start.go
internal/engine/engine.go
internal/cmn/config/definition.go
internal/cmd/validate.go
internal/cmn/config/loader.go
go.mod
internal/cmn/config/config.go

coderabbitai · 2026-05-07T11:31:40Z

 		}
 	} else {
-		id = uuid.New().String()
+		id = uuid.Must(uuid.NewV7()).String()


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Expectation: # - No runtime/session creation paths should use panic-based UUID creation. # - `uuid.NewV7()` should be used with explicit error handling. rg -nP 'uuid\.Must$\s*uuid\.NewV7\($\s*\)' --type go -C2 rg -nP 'uuid\.NewV7' --type go -C2

Repository: dagucloud/dagu

Length of output: 50371

🏁 Script executed:

# Read the file around the flagged lines wc -l internal/agent/api.go

Repository: dagucloud/dagu

Length of output: 85

🏁 Script executed:

# Get the specific lines and surrounding context sed -n '1100,1110p; 1160,1170p; 1275,1285p' internal/agent/api.go | cat -n

Repository: dagucloud/dagu

Length of output: 1289

🏁 Script executed:

# Get function signatures for the three methods mentioned sed -n '1050,1115p; 1150,1175p; 1260,1290p' internal/agent/api.go | cat -n

Repository: dagucloud/dagu

Length of output: 5032

🏁 Script executed:

# Check if there are any uuid.NewV7() calls with error handling in the codebase rg 'uuid\.NewV7' --type go -B2 -A2 | grep -E '(err|error)' | head -20

Repository: dagucloud/dagu

Length of output: 1489

🏁 Script executed:

# Search for error handling patterns with uuid generation rg 'if err.*uuid' --type go -A2 -B2

Repository: dagucloud/dagu

Length of output: 40

🏁 Script executed:

# Found it! Let's look at the uuid.NewV7() error handling in manager.go sed -n '1,50p' internal/runtime/manager.go | rg -A5 'GenDAGRunID'

Repository: dagucloud/dagu

Length of output: 40

🏁 Script executed:

# Get the full function that handles uuid.NewV7() errors sed -n '/func.*GenDAGRunID/,/^}/p' internal/runtime/manager.go | head -20

Repository: dagucloud/dagu

Length of output: 263

🏁 Script executed:

# Check how many uuid.NewV7() calls actually handle errors vs use Must rg 'uuid\.NewV7' --type go | wc -l rg 'uuid\.Must.*uuid\.NewV7' --type go | wc -l

Repository: dagucloud/dagu

Length of output: 66

🏁 Script executed:

# Verify the three functions in api.go have error return types sed -n '1105,1120p; 1160,1175p; 1275,1290p' internal/agent/api.go | grep -E '(func|return)'

Repository: dagucloud/dagu

Length of output: 372

🏁 Script executed:

# Let's verify the actual line numbers more precisely for api.go grep -n 'uuid.Must(uuid.NewV7())' internal/agent/api.go

Repository: dagucloud/dagu

Length of output: 196

🏁 Script executed:

# Check the context around those lines to confirm they're in the functions mentioned sed -n '1100,1107p; 1160,1167p; 1275,1282p' internal/agent/api.go | cat -n

Repository: dagucloud/dagu

Length of output: 896

Avoid panic-based UUID generation in functions with error returns—but address this consistently across the codebase.

Lines 1104, 1164, and 1279 use uuid.Must(uuid.NewV7()). While these functions return error types and could propagate UUID generation failures, this pattern is used in 145+ places across the codebase (including other production request/session paths). Implementing error handling for only these three functions creates inconsistency. If the team intends to improve UUID error handling, apply it consistently across all production uses, or document a codebase-wide decision to rely on uuid.Must() for its extremely low failure rate.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/agent/api.go` at line 1104, The current use of uuid.Must(uuid.NewV7()) (e.g., the id assignment at the shown location and the similar usages at the other occurrences around lines 1164 and 1279) panics instead of returning errors; replace these with non-panicking generation by calling uuid.NewV7(), checking the returned error, and propagating or wrapping that error via the function's existing error return (or returning a descriptive error) where id is assigned, so the caller can handle failures; alternatively, if the team prefers to keep uuid.Must(), document and apply that decision consistently across all production UUID usages rather than changing only these three sites.

coderabbitai · 2026-05-07T11:31:40Z

+	if c.Config.ControlPlaneStore.Backend == config.ControlPlaneStoreBackendPostgres {
+		if c.ControlPlaneStore != nil && controlPlaneStoreRoleForCommand(c.Command) == controlplanestore.RoleScheduler {
+			schedulerControlPlaneStore = c.ControlPlaneStore
+		} else {
+			schedulerControlPlaneStore, err = controlplanestore.New(
+				c,
+				c.Config,
+				controlplanestore.WithRole(controlplanestore.RoleScheduler),
+				controlplanestore.WithLatestStatusToday(false),
+				controlplanestore.WithLocation(c.Config.Core.Location),
+				controlplanestore.WithHistoryFileCache(statusCache),
+			)
+			if err != nil {
+				return nil, fmt.Errorf("failed to initialize scheduler control-plane store: %w", err)
+			}
+		}
+		if schedulerControlPlaneStore != nil {
+			schedulerRunStore = schedulerControlPlaneStore.DAGRuns()
+			schedulerQueueStore = schedulerControlPlaneStore.Queue()
+			schedulerServiceRegistry = schedulerControlPlaneStore.Services()
+			schedulerDAGRunLeaseStore = schedulerControlPlaneStore.DAGRunLeases()
+			schedulerDispatchTaskStore = schedulerControlPlaneStore.DispatchTasks()
+		}
+	}
+	if schedulerRunStore == nil {
+		schedulerRunStore, err = dagrunstore.New(
+			c,
+			c.Config,
+			dagrunstore.WithRole(dagrunstore.RoleScheduler),
+			dagrunstore.WithLatestStatusToday(false),
+			dagrunstore.WithLocation(c.Config.Core.Location),
+			dagrunstore.WithHistoryFileCache(statusCache),
+		)
+		if err != nil {
+			return nil, fmt.Errorf("failed to initialize scheduler DAG-run store: %w", err)
+		}
+	}
+	ownsSchedulerRunStore := true
+	defer func() {
+		if ownsSchedulerRunStore {
+			if closeErr := exec.CloseDAGRunStore(context.Background(), schedulerRunStore); closeErr != nil {
+				logger.Warn(c, "Failed to close scheduler DAG-run store", tag.Error(closeErr))
+			}
+		}
+	}()
 	schedulerRunMgr := runtime.NewManager(schedulerRunStore, c.ProcStore, c.Config)

-	sched, err := scheduler.New(c.Config, m, schedulerRunMgr, schedulerRunStore, c.QueueStore, c.ProcStore, c.ServiceRegistry, coordinatorCli, wmStore)
+	sched, err := scheduler.New(c.Config, m, schedulerRunMgr, schedulerRunStore, schedulerQueueStore, c.ProcStore, schedulerServiceRegistry, coordinatorCli, wmStore)
 	if err != nil {
 		return nil, err
 	}
+	ownsSchedulerRunStore = false


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Newly-opened schedulerControlPlaneStore is never closed.

When the postgres backend is selected and c.ControlPlaneStore is either nil or assigned to a different role, the else branch (Lines 865-877) opens a brand-new control-plane store with controlplanestore.New(...). Its DAG-run sub-store is then assigned to schedulerRunStore, but the existing defer at Line 900-906 only calls exec.CloseDAGRunStore(...) on the sub-store. That does not release the underlying pgx pool owned by the parent schedulerControlPlaneStore.Close() — so:

On scheduler.New failure (Line 909-912): the new control-plane store leaks (in addition to the dag-run sub-store close being effectively a no-op for the pool).

On success: the scheduler holds schedulerRunStore etc., but no shutdown path ever invokes schedulerControlPlaneStore.Close() — the pool lives until process exit.

This is the same ownership-transfer pattern fixed previously for schedulerRunStore; it needs to be extended to the control-plane store when this command opens it.

🔧 Sketch of the ownership pattern

if c.Config.ControlPlaneStore.Backend == config.ControlPlaneStoreBackendPostgres { + ownsSchedulerControlPlaneStore := false if c.ControlPlaneStore != nil && controlPlaneStoreRoleForCommand(c.Command) == controlplanestore.RoleScheduler { schedulerControlPlaneStore = c.ControlPlaneStore } else { schedulerControlPlaneStore, err = controlplanestore.New( c, c.Config, controlplanestore.WithRole(controlplanestore.RoleScheduler), controlplanestore.WithLatestStatusToday(false), controlplanestore.WithLocation(c.Config.Core.Location), controlplanestore.WithHistoryFileCache(statusCache), ) if err != nil { return nil, fmt.Errorf("failed to initialize scheduler control-plane store: %w", err) } + ownsSchedulerControlPlaneStore = true + defer func() { + if ownsSchedulerControlPlaneStore { + if closeErr := schedulerControlPlaneStore.Close(); closeErr != nil { + logger.Warn(c, "Failed to close scheduler control-plane store", tag.Error(closeErr)) + } + } + }() } // ... extract sub-stores ... } // ... open dagrun store fallback, scheduler.New, etc. ... ownsSchedulerRunStore = false + // Hand ownership of the control-plane store to the scheduler/Context lifecycle. + ownsSchedulerControlPlaneStore = false

You will also need a place — likely the scheduler shutdown or Context.Close — to actually invoke schedulerControlPlaneStore.Close(); otherwise the pool simply lives for the process lifetime.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cmd/context.go` around lines 862 - 913, The newly-created schedulerControlPlaneStore from controlplanestore.New is never closed; add ownership tracking and proper cleanup similar to schedulerRunStore: set a bool (e.g., ownsSchedulerControlPlaneStore) to true when you create schedulerControlPlaneStore in the else branch, add a defer that calls schedulerControlPlaneStore.Close() (or the appropriate close method) if ownsSchedulerControlPlaneStore is true so it will be closed on early returns (including on scheduler.New error), and ensure you clear the flag (set to false) only when ownership is intentionally transferred to the longer-lived shutdown path (e.g., register it so Context.Close or the scheduler shutdown will call schedulerControlPlaneStore.Close()); reference schedulerControlPlaneStore, controlplanestore.New, ownsSchedulerRunStore/exec.CloseDAGRunStore, scheduler.New, and Context.Close when making these changes.

coderabbitai · 2026-05-07T11:31:40Z

+    "ControlPlaneStorePostgresAgentDef": {
+      "type": "object",
+      "description": "PostgreSQL control-plane store configuration for DAG execution processes.",
+      "additionalProperties": false,
+      "properties": {
+        "dsn": {
+          "type": "string",
+          "description": "PostgreSQL connection string used by DAG execution processes when direct_access is enabled."
+        },
+        "auto_migrate": {
+          "type": "boolean",
+          "description": "Run control-plane store migrations when an agent process opens the store. Default: false."
+        },
+        "direct_access": {
+          "type": "boolean",
+          "description": "Allow DAG execution processes to open the PostgreSQL control-plane store directly. Default: false; production PostgreSQL deployments should use coordinator/shared-nothing worker execution instead."
+        },
+        "pool": {
+          "$ref": "#/definitions/PostgresPoolDef"
+        }
+      }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require dsn when agent direct access is enabled.

Right now agent: { direct_access: true } passes schema validation without a connection string, but that configuration is not usable at runtime.

Suggested change

"ControlPlaneStorePostgresAgentDef": { "type": "object", "description": "PostgreSQL control-plane store configuration for DAG execution processes.", "additionalProperties": false, "properties": { "dsn": { "type": "string", "description": "PostgreSQL connection string used by DAG execution processes when direct_access is enabled." }, "auto_migrate": { "type": "boolean", "description": "Run control-plane store migrations when an agent process opens the store. Default: false." }, "direct_access": { "type": "boolean", "description": "Allow DAG execution processes to open the PostgreSQL control-plane store directly. Default: false; production PostgreSQL deployments should use coordinator/shared-nothing worker execution instead." }, "pool": { "$ref": "#/definitions/PostgresPoolDef" } - } + }, + "allOf": [ + { + "if": { + "properties": { + "direct_access": { "const": true } + }, + "required": ["direct_access"] + }, + "then": { + "required": ["dsn"] + } + } + ] },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"ControlPlaneStorePostgresAgentDef": {

"type": "object",

"description": "PostgreSQL control-plane store configuration for DAG execution processes.",

"additionalProperties": false,

"properties": {

"dsn": {

"type": "string",

"description": "PostgreSQL connection string used by DAG execution processes when direct_access is enabled."

},

"auto_migrate": {

"type": "boolean",

"description": "Run control-plane store migrations when an agent process opens the store. Default: false."

},

"direct_access": {

"type": "boolean",

"description": "Allow DAG execution processes to open the PostgreSQL control-plane store directly. Default: false; production PostgreSQL deployments should use coordinator/shared-nothing worker execution instead."

},

"pool": {

"$ref": "#/definitions/PostgresPoolDef"

}

}

"ControlPlaneStorePostgresAgentDef": {

"type": "object",

"description": "PostgreSQL control-plane store configuration for DAG execution processes.",

"additionalProperties": false,

"properties": {

"dsn": {

"type": "string",

"description": "PostgreSQL connection string used by DAG execution processes when direct_access is enabled."

},

"auto_migrate": {

"type": "boolean",

"description": "Run control-plane store migrations when an agent process opens the store. Default: false."

},

"direct_access": {

"type": "boolean",

"description": "Allow DAG execution processes to open the PostgreSQL control-plane store directly. Default: false; production PostgreSQL deployments should use coordinator/shared-nothing worker execution instead."

},

"pool": {

"$ref": "#/definitions/PostgresPoolDef"

}

},

"allOf": [

{

"if": {

"properties": {

"direct_access": { "const": true }

},

"required": ["direct_access"]

},

"then": {

"required": ["dsn"]

}

}

]

},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cmn/schema/config.schema.json` around lines 1011 - 1031, The schema currently allows ControlPlaneStorePostgresAgentDef objects with direct_access: true but no dsn; update the ControlPlaneStorePostgresAgentDef definition to enforce that when "direct_access" is true the "dsn" property is required by adding a JSON Schema conditional (an "if" that checks properties.direct_access.const true and a matching "then" that adds required: ["dsn"]) so validation fails unless dsn is present; reference the ControlPlaneStorePostgresAgentDef object and its properties "direct_access" and "dsn" when making the change.

coderabbitai · 2026-05-07T11:31:40Z

+		RowLimit:     int32(limit),  //nolint:gosec
+		RowOffset:    int32(offset), //nolint:gosec


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify int-based pagination fields and narrowing to int32 in the Postgres mapper. rg -n -C2 'type QueryFilter|Limit|Offset' --type go internal rg -n -C3 'auditQueryParams|RowLimit|RowOffset|int32$limit$|int32$offset$' internal/persis/controlplanestore/postgres/audit_store.go

Repository: dagucloud/dagu

Length of output: 50371

🏁 Script executed:

cat -n internal/persis/controlplanestore/postgres/audit_store.go | head -120

Repository: dagucloud/dagu

Length of output: 4051

🏁 Script executed:

rg -n "type QueryFilter" --type go -A 15 internal/service/audit/

Repository: dagucloud/dagu

Length of output: 1126

🏁 Script executed:

rg -n "auditQueryParams" --type go internal/

Repository: dagucloud/dagu

Length of output: 327

🏁 Script executed:

rg -n "\.Query$|\.Audit\($" --type go internal/service/frontend/api | grep -i audit

Repository: dagucloud/dagu

Length of output: 153

🏁 Script executed:

rg -n "audit.Query|auditStore.Query|AuditStore.*Query" --type go internal/

Repository: dagucloud/dagu

Length of output: 2081

🏁 Script executed:

cat -n internal/service/frontend/api/v1/audit.go | head -100

Repository: dagucloud/dagu

Length of output: 3248

🏁 Script executed:

rg -n "filter.Offset|filter.Limit" --type go internal/service/frontend/api/v1/audit.go

Repository: dagucloud/dagu

Length of output: 323

🏁 Script executed:

# Check if there are other similar direct int to int32 casts for pagination rg -n "int32\(" --type go internal/persis/controlplanestore/postgres/ | grep -i "limit\|offset"

Repository: dagucloud/dagu

Length of output: 935

Clamp pagination values before narrowing to int32.

Lines 95-96 cast int values directly to int32 without upper bounds checking. While the API handler caps limit to 500, offset has no upper bound and can overflow to invalid negative values when cast to int32. Defensive clamping protects against direct internal callers that bypass API validation.

Proposed fix

func auditQueryParams(filter audit.QueryFilter, limit, offset int) db.QueryAuditEntriesParams { + const maxInt32 = int(^uint32(0) >> 1) + if limit > maxInt32 { + limit = maxInt32 + } + if offset > maxInt32 { + offset = maxInt32 + } return db.QueryAuditEntriesParams{ HasCategory: filter.Category != "", Category: string(filter.Category),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/controlplanestore/postgres/audit_store.go` around lines 95 - 96, Clamp the pagination integers before converting to int32: ensure `limit` is bounded (e.g., min 0, max 500 or math.MaxInt32 if you want a hard upper clamp) and ensure `offset` is non-negative and not greater than math.MaxInt32, then cast to int32 when assigning `RowLimit` and `RowOffset` in the code that builds the query (the fields named RowLimit and RowOffset in audit_store.go). Replace the direct casts int32(limit)/int32(offset) with a small helper or inline clamping logic that applies bounds and then casts to int32 to prevent overflow/negative values from internal callers.

coderabbitai · 2026-05-07T11:31:40Z

+func (s *userStore) Update(ctx context.Context, user *auth.User) error {
+	if user == nil {
+		return errors.New("postgres user store: user cannot be nil")
+	}
+	id, err := parseUUIDv7(user.ID)
+	if err != nil {
+		return auth.ErrInvalidUserID
+	}
+	if user.Username == "" {
+		return auth.ErrInvalidUsername
+	}
+	if !user.Role.Valid() {
+		return auth.ErrInvalidRole
+	}
+	workspaceAccess, err := marshalWorkspaceAccess(user.WorkspaceAccess)
+	if err != nil {
+		return err
+	}
+	data, err := json.Marshal(user.ToStorage())
+	if err != nil {
+		return fmt.Errorf("marshal user: %w", err)
+	}
+	rows, err := s.store.queries.UpdateUser(ctx, db.UpdateUserParams{
+		Username:        user.Username,
+		PasswordHash:    user.PasswordHash,
+		Role:            string(user.Role),
+		AuthProvider:    user.AuthProvider,
+		OidcIssuer:      user.OIDCIssuer,
+		OidcSubject:     user.OIDCSubject,
+		IsDisabled:      user.IsDisabled,
+		WorkspaceAccess: workspaceAccess,
+		Data:            data,
+		UpdatedAt:       timestamptz(user.UpdatedAt),
+		ID:              id,
+	})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the OIDC identity invariant on update.

Create rejects rows where only one of OIDCIssuer / OIDCSubject is set, but Update can write exactly that state. That leaves a user record that GetByOIDCIdentity cannot match correctly.

Suggested change

if user.Username == "" { return auth.ErrInvalidUsername } if !user.Role.Valid() { return auth.ErrInvalidRole } + if (user.OIDCIssuer == "") != (user.OIDCSubject == "") { + return auth.ErrOIDCIdentityNotFound + } workspaceAccess, err := marshalWorkspaceAccess(user.WorkspaceAccess)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (s *userStore) Update(ctx context.Context, user *auth.User) error {

if user == nil {

return errors.New("postgres user store: user cannot be nil")

}

id, err := parseUUIDv7(user.ID)

if err != nil {

return auth.ErrInvalidUserID

}

if user.Username == "" {

return auth.ErrInvalidUsername

}

if !user.Role.Valid() {

return auth.ErrInvalidRole

}

workspaceAccess, err := marshalWorkspaceAccess(user.WorkspaceAccess)

if err != nil {

return err

}

data, err := json.Marshal(user.ToStorage())

if err != nil {

return fmt.Errorf("marshal user: %w", err)

}

rows, err := s.store.queries.UpdateUser(ctx, db.UpdateUserParams{

Username: user.Username,

PasswordHash: user.PasswordHash,

Role: string(user.Role),

AuthProvider: user.AuthProvider,

OidcIssuer: user.OIDCIssuer,

OidcSubject: user.OIDCSubject,

IsDisabled: user.IsDisabled,

WorkspaceAccess: workspaceAccess,

Data: data,

UpdatedAt: timestamptz(user.UpdatedAt),

ID: id,

})

func (s *userStore) Update(ctx context.Context, user *auth.User) error {

if user == nil {

return errors.New("postgres user store: user cannot be nil")

}

id, err := parseUUIDv7(user.ID)

if err != nil {

return auth.ErrInvalidUserID

}

if user.Username == "" {

return auth.ErrInvalidUsername

}

if !user.Role.Valid() {

return auth.ErrInvalidRole

}

if (user.OIDCIssuer == "") != (user.OIDCSubject == "") {

return auth.ErrOIDCIdentityNotFound

}

workspaceAccess, err := marshalWorkspaceAccess(user.WorkspaceAccess)

if err != nil {

return err

}

data, err := json.Marshal(user.ToStorage())

if err != nil {

return fmt.Errorf("marshal user: %w", err)

}

rows, err := s.store.queries.UpdateUser(ctx, db.UpdateUserParams{

Username: user.Username,

PasswordHash: user.PasswordHash,

Role: string(user.Role),

AuthProvider: user.AuthProvider,

OidcIssuer: user.OIDCIssuer,

OidcSubject: user.OIDCSubject,

IsDisabled: user.IsDisabled,

WorkspaceAccess: workspaceAccess,

Data: data,

UpdatedAt: timestamptz(user.UpdatedAt),

ID: id,

})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/controlplanestore/postgres/auth_store.go` around lines 155 - 189, The Update method can write a user with only one of OIDCIssuer / OIDCSubject set, breaking the invariant used by GetByOIDCIdentity; before marshalling or calling s.store.queries.UpdateUser in userStore.Update, validate that either both user.OIDCIssuer and user.OIDCSubject are empty or both are non-empty, and if that check fails return a suitable error (e.g. auth.ErrInvalidOIDCIdentity or another existing auth error) so Update rejects partial OIDC identity updates just like Create does.

coderabbitai · 2026-05-07T11:31:41Z

+	for _, reg := range registrations {
+		if reg.cancel != nil {
+			reg.cancel()
+		}
+		reg.wg.Wait()
+		_, _ = s.queries.DeleteServiceInstance(ctx, db.DeleteServiceInstanceParams{
+			ServiceName: string(reg.serviceName),
+			InstanceID:  reg.hostInfo.ID,
+		})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently drop unregister cleanup failures.

Once the heartbeat is canceled, a failed DeleteServiceInstance leaves the row discoverable until serviceStaleTimeout elapses. This is especially easy to hit during shutdown if ctx is already canceled.

Suggested change

for _, reg := range registrations { if reg.cancel != nil { reg.cancel() } reg.wg.Wait() - _, _ = s.queries.DeleteServiceInstance(ctx, db.DeleteServiceInstanceParams{ + cleanupCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), 5*time.Second) + _, err := s.queries.DeleteServiceInstance(cleanupCtx, db.DeleteServiceInstanceParams{ ServiceName: string(reg.serviceName), InstanceID: reg.hostInfo.ID, }) + cancel() + if err != nil { + // surface this through logging/metrics instead of dropping it silently + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for _, reg := range registrations {

if reg.cancel != nil {

reg.cancel()

}

reg.wg.Wait()

_, _ = s.queries.DeleteServiceInstance(ctx, db.DeleteServiceInstanceParams{

ServiceName: string(reg.serviceName),

InstanceID: reg.hostInfo.ID,

})

for _, reg := range registrations {

if reg.cancel != nil {

reg.cancel()

}

reg.wg.Wait()

cleanupCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), 5*time.Second)

_, err := s.queries.DeleteServiceInstance(cleanupCtx, db.DeleteServiceInstanceParams{

ServiceName: string(reg.serviceName),

InstanceID: reg.hostInfo.ID,

})

cancel()

if err != nil {

// surface this through logging/metrics instead of dropping it silently

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/controlplanestore/postgres/registry.go` around lines 89 - 97, The loop currently ignores errors from s.queries.DeleteServiceInstance after canceling reg.cancel and waiting reg.wg, so deletions can silently fail (especially if ctx is canceled). Change the delete call to run with a fresh bounded context (e.g., use context.Background() or context.WithTimeout when ctx is already canceled) and capture the returned error instead of discarding it; if deletion fails log the error (or retry once) including reg.serviceName and reg.hostInfo.ID so the failure is observable and not left to serviceStaleTimeout. Ensure you update the code around registrations, reg.cancel, reg.wg.Wait and s.queries.DeleteServiceInstance to use the new ctx and handle/log the error.

coderabbitai · 2026-05-07T11:31:41Z

+	msgIDString, msgUUID, err := ensureUUIDv7String(msg.ID)
+	if err != nil {
+		msgIDString, msgUUID, err = newUUIDv7String()
+		if err != nil {
+			return fmt.Errorf("generate message id: %w", err)
+		}
+	}
+	msg.ID = msgIDString


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Silent replacement of a non-empty invalid msg.ID is unexpected.

ensureUUIDv7String returns an error for any non-UUID v7 string, including a non-empty but wrongly-formatted ID. When that happens the code silently generates a fresh UUID and overwrites msg.ID on the caller's object. Callers that set an explicit ID (e.g. for idempotent inserts) would have their ID silently changed without any indication.

A cleaner contract: only auto-generate when msg.ID == ""; reject a non-empty invalid ID with an explicit error.

🛡️ Proposed fix

- msgIDString, msgUUID, err := ensureUUIDv7String(msg.ID) - if err != nil { - msgIDString, msgUUID, err = newUUIDv7String() - if err != nil { - return fmt.Errorf("generate message id: %w", err) - } - } + var msgIDString string + var msgUUID uuid.UUID + if msg.ID == "" { + var err error + msgIDString, msgUUID, err = newUUIDv7String() + if err != nil { + return fmt.Errorf("generate message id: %w", err) + } + } else { + var err error + msgIDString, msgUUID, err = ensureUUIDv7String(msg.ID) + if err != nil { + return agent.ErrInvalidSessionID + } + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/controlplanestore/postgres/session_store.go` around lines 170 - 177, Currently the code silently replaces any invalid msg.ID by calling ensureUUIDv7String and falling back to newUUIDv7String; change this so you only auto-generate a new UUID when msg.ID is empty and otherwise validate the provided ID: call ensureUUIDv7String only to validate a non-empty msg.ID and return an explicit error if it is invalid, and call newUUIDv7String only when msg.ID == "" to populate msg.ID; update the block around ensureUUIDv7String/newUUIDv7String and the assignment to msg.ID accordingly.

coderabbitai · 2026-05-07T11:31:41Z

+	return s.store.withTx(ctx, func(q *db.Queries) error {
+		row, err := q.GetAgentSession(ctx, sessionUUID)
+		if errors.Is(err, pgx.ErrNoRows) {
+			return agent.ErrSessionNotFound
+		}
+		if err != nil {
+			return err
+		}
+		if err := q.AddAgentSessionMessage(ctx, db.AddAgentSessionMessageParams{
+			ID:          msgUUID,
+			SessionID:   sessionUUID,
+			MessageType: string(msg.Type),
+			SequenceID:  msg.SequenceID,
+			CreatedAt:   timestamptz(msg.CreatedAt),
+			Data:        data,
+		}); err != nil {
+			return err
+		}
+
+		updatedAt := time.Now().UTC()
+		sess, err := sessionFromRow(row)
+		if err != nil {
+			return err
+		}
+		if sess.Title == "" && msg.Type == agent.MessageTypeUser && msg.Content != "" {
+			sess.Title = truncateAgentSessionTitle(msg.Content)
+		}
+		sess.UpdatedAt = updatedAt
+		parentID, err := nullUUIDv7(sess.ParentSessionID)
+		if err != nil {
+			return err
+		}
+		sessionData, err := json.Marshal(sessionStorageFromSession(sess, nil))
+		if err != nil {
+			return fmt.Errorf("marshal session: %w", err)
+		}
+		rows, err := q.UpdateAgentSession(ctx, db.UpdateAgentSessionParams{
+			ID:              sessionUUID,
+			UserID:          sess.UserID,
+			DagName:         sess.DAGName,
+			Title:           sess.Title,
+			Model:           sess.Model,
+			ParentSessionID: parentID,
+			DelegateTask:    sess.DelegateTask,
+			Data:            sessionData,
+			UpdatedAt:       timestamptz(updatedAt),
+		})
+		if err != nil {
+			return err
+		}
+		if rows == 0 {
+			return agent.ErrSessionNotFound
+		}
+		return nil
+	})


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Lost-update race in AddMessage: missing row-level lock on session read.

Inside the transaction, GetAgentSession reads the session at READ COMMITTED isolation with no locking. Two concurrent AddMessage calls for the same session can both observe an empty title, both derive a title from their respective first user-message content, and the last writer silently overwrites the earlier commit. The same race applies to UpdatedAt.

Add SELECT ... FOR UPDATE (or a FOR UPDATE variant in the sqlc query) when fetching the session row to serialise concurrent writers.

-- name: GetAgentSessionForUpdate :one SELECT * FROM dagu_agent_sessions WHERE id = sqlc.arg(id) FOR UPDATE;

Then use q.GetAgentSessionForUpdate inside withTx.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/controlplanestore/postgres/session_store.go` around lines 187 - 241, The AddMessage transaction reads the session with q.GetAgentSession without a row-level lock, allowing lost-update races; change the sqlc query to add FOR UPDATE (e.g., create "GetAgentSessionForUpdate :one" selecting FROM dagu_agent_sessions ... FOR UPDATE) and replace the call to q.GetAgentSession with q.GetAgentSessionForUpdate inside the withTx block so concurrent AddMessage calls serialize updates to Title and UpdatedAt; ensure the new query is used in the same transaction context (withTx) and update any error handling references accordingly.

coderabbitai · 2026-05-07T11:31:41Z


 	enqueuedAt := time.Now().UTC()
-	fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.NewString())
+	fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.Must(uuid.NewV7()).String())


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify all panic-based UUIDv7 call sites and google/uuid version usage. set -euo pipefail echo "== go.mod / go.sum references ==" rg -n --type=go 'github.com/google/uuid|^go [0-9]+\.[0-9]+' go.mod go.sum || true echo echo "== panic-based UUIDv7 call sites ==" rg -n --type=go 'uuid\.Must\s*$\s*uuid\.NewV7\($\s*\)' internal

Repository: dagucloud/dagu

Length of output: 14833

Handle UUID generation errors instead of panicking in dispatch task store

Lines 81 and 117 use uuid.Must(uuid.NewV7()), which panics on UUID generation failures. Since both Enqueue and ClaimNext already return errors, handle generation failures explicitly to avoid process crashes.

Proposed fix

- fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.Must(uuid.NewV7()).String()) + taskID, err := uuid.NewV7() + if err != nil { + return fmt.Errorf("generate dispatch task id: %w", err) + } + fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), taskID.String())

- claimToken := uuid.Must(uuid.NewV7()).String() + claimID, err := uuid.NewV7() + if err != nil { + return nil, fmt.Errorf("generate dispatch claim token: %w", err) + } + claimToken := claimID.String()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.Must(uuid.NewV7()).String())

taskID, err := uuid.NewV7()

if err != nil {

return fmt.Errorf("generate dispatch task id: %w", err)

}

fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), taskID.String())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/persis/filedistributed/dispatch_task_store.go` at line 81, The code currently uses uuid.Must(uuid.NewV7()) which panics on failure; change both places (in Enqueue where fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.Must(uuid.NewV7()).String()) and in ClaimNext where uuid.Must is used) to call uuid.NewV7() directly, check the returned error, and propagate a wrapped error (return fmt.Errorf(...): include context like "generating task UUID" or "generating claim UUID") instead of panicking; update the fileName construction to use the validated uuid string and ensure both Enqueue and ClaimNext return that error to the caller.

coderabbitai · 2026-05-07T11:31:41Z

 		user = &auth.User{
-			ID:              uuid.New().String(),
+			ID:              uuid.Must(uuid.NewV7()).String(),
 			Username:        username,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace uuid.Must() with explicit error handling in the request path.

NewV7 returns Nil and an error on failure (when crypto/rand is unavailable). Wrapping it in uuid.Must() inside a per-request handler causes a panic instead of a graceful error return, which would crash the goroutine serving this OIDC login.

🛡️ Proposed fix

+ id, err := uuid.NewV7() + if err != nil { + return nil, false, fmt.Errorf("failed to generate user ID: %w", err) + } user = &auth.User{ - ID: uuid.Must(uuid.NewV7()).String(), + ID: id.String(),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

user = &auth.User{

ID: uuid.New().String(),

ID: uuid.Must(uuid.NewV7()).String(),

Username: username,

id, err := uuid.NewV7()

if err != nil {

return nil, false, fmt.Errorf("failed to generate user ID: %w", err)

}

user = &auth.User{

ID: id.String(),

Username: username,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/oidcprovision/service.go` around lines 159 - 161, Replace the panic-prone uuid.Must call in the request path by calling uuid.NewV7() directly, checking its returned error, and returning a graceful error response from the OIDC provisioning handler instead of allowing a panic; update the code that constructs the auth.User (where ID is set) to use the validated UUID string only when err == nil, and propagate a suitable error/HTTP response from the function that creates the user when uuid.NewV7() fails.

…-store # Conflicts: # internal/service/frontend/server.go

feat: add PostgreSQL DAG-run store

bb74984

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

yottahmd added 2 commits May 6, 2026 20:03

fix: address PostgreSQL DAG-run store review feedback

0aa84f5

fix: address PostgreSQL DAG-run store nitpicks

3b62c72

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread internal/cmd/context.go Outdated

Comment thread internal/persis/dagrunstore/postgres/cursor.go Outdated

Comment thread internal/persis/controlplanestore/postgres/store.go Outdated

Comment thread internal/service/scheduler/scheduler.go Outdated

Comment thread Makefile Outdated

fix: address PostgreSQL DAG-run store review

cc31f68

fix: align Postgres DAG-run cleanup retention

7f0751b

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread internal/persis/dagrunstore/postgres/queries/attempts.sql Outdated

Comment thread internal/persis/dagrunstore/postgres/queries/attempts.sql Outdated

fix: split Postgres DAG-run store connection roles

d63d0af

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread internal/cmd/context.go

Comment thread internal/cmd/dagrunstore_role_test.go

Comment thread internal/cmn/config/config.go Outdated

Comment thread internal/cmn/config/key_hints.go Outdated

Comment thread internal/engine/engine.go

test: make retention query parsing portable

857ce02

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

yottahmd added 2 commits May 6, 2026 22:31

test: make local queue FIFO sleep portable

2ba35b1

fix: address PostgreSQL DAG-run review feedback

bb7f636

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread internal/cmd/context.go Outdated

Comment thread internal/cmn/config/config.go Outdated

Comment thread Makefile

fix: address PostgreSQL DAG-run follow-up review

8d05879

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

yottahmd closed this May 6, 2026

fix: require opt-in PostgreSQL agent access

9f5a715

yottahmd reopened this May 6, 2026

chore: add PostgreSQL dev stop target

548a907

yottahmd added 8 commits May 7, 2026 10:57

fix: split PostgreSQL DAG-run summary from attempts

c93ee71

chore: merge main

599517f

ci: predownload modules before go fix

5f05fca

ci: isolate postgres integration tests

2ad9a86

refactor: polish control-plane store implementation

832d49b

refactor: canonicalize postgres control-plane data

0a9282f

refactor: separate coordinator postgres store config

17c1c81

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into codex/postgres-dagrun…

2ad7d3e

…-store # Conflicts: # internal/service/frontend/server.go

yottahmd added this to the v2.7.0 milestone May 7, 2026

yottahmd added 2 commits May 8, 2026 19:52

fix: address postgres store review feedback

e06b601

fix: update x/net for govulncheck

d14c197

		RowLimit: int32(limit), //nolint:gosec
		RowOffset: int32(offset), //nolint:gosec

-	fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), uuid.Must(uuid.NewV7()).String())
+	taskID, err := uuid.NewV7()
+	if err != nil {
+		return fmt.Errorf("generate dispatch task id: %w", err)
+	}
+	fileName := fmt.Sprintf("task_%020d_%s.json", enqueuedAt.UnixMilli(), taskID.String())

Uh oh!

Conversation

yottahmd commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yottahmd commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yottahmd commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading