Skip to content

Mainnet /api/health gagent-service unhealthy: ES sort on script-catalog-entries.updated_at_utc_value fails when field has no mapping #410

@eanzhao

Description

@eanzhao

Symptom

GET https://aevatar-console-backend-api.aevatar.ai/api/health returns HTTP 503 with gagent-service reported as unhealthy and the entire host as not-ready.

{
  "name": "gagent-service",
  "category": "capability",
  "critical": true,
  "status": "unhealthy",
  "message": "Elasticsearch query failed: 400 Bad Request. body={\"error\":{\"root_cause\":[{\"type\":\"query_shard_exception\",\"reason\":\"No mapping found for [updated_at_utc_value] in order to sort on\",\"index_uuid\":\"C_2X-E2gRsGUX_sigX8cug\",\"index\":\"aevatar-mainnet-script-catalog-entries\"}], ... ,\"status\":400}",
  "details": {
    "requiredRoutes": "/api/services, /api/scopes/{scopeId}/binding, /api/scopes/{scopeId}/workflows, /api/scopes/{scopeId}/scripts",
    "exceptionType": "System.InvalidOperationException"
  }
}

Because gagent-service is critical: true, the whole host is not-ready. Other capabilities (scripting-bundle, studio, workflow-bundle, workflow-document-readmodel, workflow-graph-readmodel) all report healthy.

Captured 2026-04-25T06:31:04Z UTC against Aevatar.Mainnet.Host.Api.

Root cause

The gagent-service probe issues an Elasticsearch query that explicitly sorts on UpdatedAt, but the index aevatar-mainnet-script-catalog-entries either has no documents (empty dynamic mapping) or has docs whose updated_at_utc_value is serialized as a nested struct (not a sortable scalar). When the sort field has no usable mapping, Elasticsearch returns 400 unless the sort clause includes unmapped_type / missing hints.

The default-sort path includes those hints; the explicit-sort path does not. That asymmetry is the bug.

Call path

  1. GAgentServiceCapabilityHostBuilderExtensions.AddGAgentServiceCapabilityBundle's probe at src/platform/Aevatar.GAgentService.Hosting/Endpoints/GAgentServiceCapabilityHostBuilderExtensions.cs:34-35 calls IScopeScriptQueryPort.ListAsync(\"health\", ct).
  2. ScopeScriptQueryApplicationService.ListAsync (src/platform/Aevatar.GAgentService.Application/Scripts/ScopeScriptQueryApplicationService.cs:22-35) → IScriptCatalogQueryPort.ListCatalogEntriesAsync(catalogActorId, take, ct).
  3. ProjectionScriptCatalogQueryPort.ListCatalogEntriesAsync (src/Aevatar.Scripting.Projection/ReadPorts/ProjectionScriptCatalogQueryPort.cs:73-94) builds a ProjectionDocumentQuery with explicit Sorts = [{ FieldPath = nameof(ScriptCatalogEntryDocument.UpdatedAt), Direction = Desc }].
  4. The Elasticsearch payload builder resolves \"UpdatedAt\" to the proto field name updated_at_utc_value via ResolveFieldPath / BuildFieldCandidates (src/Aevatar.CQRS.Projection.Providers.Elasticsearch/Stores/ElasticsearchProjectionDocumentStore.cs:317-394). The _utc_value candidate matches ScriptCatalogEntryDocument.updated_at_utc_value (proto field 5, google.protobuf.Timestamp, see src/Aevatar.Scripting.Projection/script_projection_read_models.proto:48-63).
  5. BuildSortSpec in src/Aevatar.CQRS.Projection.Providers.Elasticsearch/Stores/ElasticsearchProjectionDocumentStorePayloadSupport.cs:176-202 takes the explicit branch and emits BuildSortClause(..., includeMissingHints: false).
  6. BuildSortClause (...PayloadSupport.cs:204-224) only adds \"missing\":\"_last\" and \"unmapped_type\":\"date\" when includeMissingHints == true. The default-sort path uses true; the explicit-sort path uses false.
  7. The index metadata provider ScriptCatalogEntryDocumentMetadataProvider (src/Aevatar.Scripting.Projection/Metadata/ScriptCatalogEntryDocumentMetadataProvider.cs:8-15) declares only \"dynamic\": true with no explicit field mappings.
  8. Without unmapped_type and with no document-derived mapping for updated_at_utc_value, Elasticsearch returns query_shard_exception: No mapping found for [updated_at_utc_value] in order to sort on.

The other two probes in the same bundle don't hit this path: IServiceLifecycleQueryPort.ListServicesAsync and IScopeWorkflowQueryPort.ListAsync either don't sort or sort in-process after fetch (ScopeWorkflowQueryApplicationService.cs:40 orders the materialized list client-side). So the failure surfaces only on the script catalog leg.

Why now / scope of impact

  • This affects any deployment where aevatar-{env}-script-catalog-entries is empty or where updated_at_utc_value was never auto-mapped — i.e. mainnet today, and any new environment before the first script publish.
  • The same defect applies to every projection query that issues an explicit Sorts = [...] against a field that may not yet be mapped (timestamp fields are the typical victim, since proto Timestamp serializes to a nested struct under System.Text.Json defaults).
  • Existing tests assert the default-sort path correctly emits unmapped_type (test/Aevatar.CQRS.Projection.Core.Tests/ElasticsearchProjectionDocumentStoreBehaviorTests.cs:77-78), but no test covers the explicit-sort path.

Suggested fix (pick or combine)

  1. Always include the safety hints on every sort clause. In BuildSortClause, always emit \"missing\":\"_last\" and \"unmapped_type\":.... Infer unmapped_type from the resolved field path (*_utc_value\"date\", otherwise \"keyword\"), or thread the proto FieldType/MessageType through fieldPathResolver so the sort builder knows the target type.
  2. Declare explicit ES mappings for sortable fields in IProjectionDocumentMetadataProvider implementations — at minimum updated_at_utc_value / created_at_utc_value as date (and document the expected JSON shape).
  3. Normalize Timestamp serialization so updated_at_utc_value lands in ES as an ISO-8601 string instead of { \"seconds\":..., \"nanos\":... }. This keeps the dynamic-mapping path honest.

Option 1 is the smallest defensible patch and unblocks the health check immediately; option 2 or 3 is the correct structural fix and should follow.

Repro

curl -sSL https://aevatar-console-backend-api.aevatar.ai/api/health | jq '.components[] | select(.name==\"gagent-service\")'

Related

Not a duplicate of #355 (Channel registration startup degraded mode) — same shape (read-model bootstrap masks an unhealthy capability) but different read model, different code path.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions