Skip to content

Content & knowledge layer: stream content, link it to code, and answer "why does this exist"#134

Merged
zzet merged 11 commits into
mainfrom
feat/content-knowledge-layer
Jun 21, 2026
Merged

Content & knowledge layer: stream content, link it to code, and answer "why does this exist"#134
zzet merged 11 commits into
mainfrom
feat/content-knowledge-layer

Conversation

@zzet

@zzet zzet commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Models documents and content (PDF, office formats, plain text) as a first-class data class alongside code, links that content to the code it explains, and adds a graph-first why query that recovers the reasoning behind a symbol — the decisions and documents that motivated it.

Motivated by the gortex init OOM on a content-heavy repo (#120): the content lane streams large documents instead of materialising them. The bytes-in-flight admission semaphore here composes with main's recent large-read gate — both the per-file byte budget and the large-read concurrency cap are active.

What's included

Content lane

  • Bound peak parse memory with a weighted bytes-in-flight semaphore, composed with the large-read gate.
  • StreamingExtractor capability — PDFs stream page-by-page through an io.ReaderAt; office formats (pptx/xlsx) stream one zip entry at a time with token-streamed XML; text is windowed. A large document is never read whole.
  • Content sub-kinds via Meta[asset_kind] (slide / sheet / page / section); metadata-only nodes for binary/data files (parquet / npy / lance / …) that are never parsed.
  • A first-class corpus: content search profile — embedding-first, prose-tuned ranking, distinct from code's BM25-first — on one query surface, no second index.

Why-layer

  • KindRationale / EdgeMotivates graph vocabulary plus its cross-repo parallel.
  • Projects store_memory decisions / incidents / constraints into the graph as traversable rationale (sidecar stays system-of-record; reconcile on read).
  • A budgeted content→code linking pass (whole-token match, capped at max(2000, 10% of live edges) so the layer can never approach doubling the graph), with structured-signal mining (ADR / RFC-2119 / ticket) upgrading link confidence above an incidental name match.
  • A why <symbol> query: a one-hop traversal over incoming EdgeMotivates, curated rationale first, content links second.

Knowledge health

  • analyze kind=doc_staleness — deterministically flags documents that reference symbols which no longer exist (dangling) or are not yet indexed (pending). Needs no git history and never false-positives.

Notable decisions

  • Content and code share one graph store — the why-query is a single traversal, not a cross-system join; the partition is logical (data_class), not a second datastore.
  • The single graph-size budget keeps the why-layer's worst case at a few percent of edges, never a doubling.
  • Deferred, documented, default-off: LLM rationale mining, and blame/timestamp-based drift tiers (the deterministic dangling signal ships in their place).

Testing

  • go build ./... clean; go vet and golangci-lint clean on touched packages.
  • Per-unit -race tests; full internal/indexer suite green (624); wire-contract golden unchanged (enum constants + Meta only, no struct fields, so gcx consumers are unaffected).

Relates to #120.

zzet added 11 commits June 21, 2026 00:30
…on semaphore

Large content files (PDFs, office docs, datasets) were read whole and parsed
in-process at NumCPU concurrency with no memory bound, so a content-heavy repo
could exhaust memory during a bulk index when a cluster of large files
materialised simultaneously (whole file plus its parse tree, times every
worker).

Admit each file into extraction under a weighted bytes-in-flight budget before
it is read, so large files serialise instead of piling up; code files are tiny
and flow freely. A file larger than the whole budget is admitted alone (weight
clamped) so the semaphore can never deadlock; the held weight is released as
soon as extraction returns.

The budget defaults on (512 MiB, configurable via index.max_parse_bytes_in_flight;
0 disables) and is a no-op for ordinary source repos. Carry walk-time file size
on walkedFile so admission needs no extra stat.

Mitigates #120 (concurrency-driven OOM); streaming content extraction follows.
…-by-page

Content extractors that work a unit at a time (PDF, and the office/text formats
to follow) no longer need the whole file resident. Add an optional
StreamingExtractor capability — ExtractStream(path, io.ReaderAt, size, emit) —
mirroring the existing opt-in PreParser pattern. The bulk indexer prefers it on
the in-process route: it hands the extractor a file handle instead of
os.ReadFile'ing the whole file, so peak memory is O(one page), not O(file). The
crash-isolation subprocess route keeps the byte protocol.

PDFExtractor implements it, reading each page through the io.ReaderAt; the
page-walk is shared with the byte-path Extract. A malformed document still
isolates to just its file node via the per-document recover.

Towards #120: a single very large document can no longer be materialised whole.
Add three content extractors that emit KindDoc chunks discriminated by
Meta[asset_kind]: pptx -> one "slide" per slide, xlsx -> one "sheet_region" per
worksheet (shared strings resolved), txt -> one "section" per windowed chunk.

All implement StreamingExtractor, so the office formats are read from the zip
central directory one part at a time via archive/zip per-entry readers and the
XML is token-streamed — a workbook that unzips to hundreds of MB of XML is never
materialised whole. Per-chunk text is capped; the shared-string table is
byte-bounded.

Registered before the generic forest grammars so they claim their extensions;
exact-basename detection (CMakeLists.txt) still wins over the new .txt mapping.
…nodes

Tag every content chunk and its file node (pdf/pptx/xlsx/txt) with
Meta[data_class]=content so the retrieval profile and why-layer can scope to the
content corpus without a separate store. Markdown stays untagged in the code
graph. No physical AllNodes() partition is needed: the heavy global passes
(reach, clone, dead_code) already key on code kinds, so KindDoc chunks never
load them — the partition is logical.

Add a metadata-only DataAssetExtractor for pure data/binary assets
(parquet/npy/npz/lance/arrow/feather): one KindFile node with size and a
size-capped streamed sha256, tagged data_class=data, never parsed. They become
listable and linkable without feeding a binary blob to any grammar.
Content chunks (pdf / office / text, tagged data_class=content) are KindDoc
nodes, so they were already excluded from the code corpus and lumped in with
docs. Add a first-class `corpus: content` that narrows to the data_class=content
chunks (excluding Markdown prose), routes through the existing doc-retrieval
channel, and engages prose-mode reranking — code-structural signals suppressed,
text + semantic channels lifted — so content retrieval is embedding-first while
code stays BM25-first. One query surface, two profiles; no second index.
Introduce the causal "why this exists" relations: KindRationale (a node
projecting a development-memory decision / incident / constraint), EdgeMotivates
(knowledge -> code symbol), and its cross-repo parallel EdgeCrossRepoMotivates,
registered in CrossRepoKindFor / BaseKindForCrossRepo / BaseKindsForCrossRepo so
DetectCrossRepoEdges materialises the motivates parallel across repo boundaries.

New enum constants and Meta only — no struct field — so the wire-contract golden
and gcx consumers are unchanged.
…dRationale

A store_memory decision / incident / constraint / invariant that is load-bearing
(pinned or importance >= 3) and anchored to code now projects into the graph as
a KindRationale node with EdgeMotivates edges to its anchored symbols and files,
so "why does X exist" is one graph hop from the code it explains.

The memory sidecar stays the system of record; the projection is a derived view,
rebuilt idempotently by evicting a sentinel virtual file (.gortex/rationale) and
re-adding the fresh eligible set. store_memory reconciles on every write; the
why query reconciles on read for memories that predate the daemon.
Add a content->code linking pass to the global graph passes: every content
KindDoc chunk (data_class=content) is scanned for code symbol names via the
artifact reference scanner (whole-token, 4-char floor, 200 refs/chunk, 1 MiB
cap), minting EdgeMotivates from the chunk to each named symbol. It runs before
DetectCrossRepoEdges, so a chunk naming a symbol in another repo gets its
cross_repo_motivates parallel for free (EdgeMotivates is in BaseKindsForCrossRepo).

A single budget — max(2000, 10% of live edges) — bounds the edges the why-layer
can add, so it can never approach doubling the graph; over budget the pass stops
and logs rather than silently truncating. Exposes SymbolNameIndex / ScanSymbolRefs
from the artifacts linker so the index is built once and every chunk scanned
against it.
A new `why` tool answers "why does this code exist": a one-hop walk over the
incoming EdgeMotivates edges of a symbol, returning the knowledge that motivates
it — projected store_memory decisions / incidents (KindRationale) and the
content documents whose text names it (KindDoc). Curated rationale ranks before
lexical content matches.

It reconciles the memory projection on read so decisions stored before the
daemon started are visible without a reindex. When nothing links the symbol it
returns a note pointing at corpus:content search and store_memory.
…code

Add `analyze kind=doc_staleness`: a deterministic, advisory pass over every
EdgeMotivates that flags the knowledge sources (content chunks and projected
rationale) whose code references have gone stale — "dangling" when the named
symbol is absent from the graph, "pending" when the target is unresolved (e.g.
a cross-repo symbol not yet indexed). Grouped by source, ranked dangling-first.

Zero false positives by design: it needs no git history and only reports
references that genuinely no longer resolve. Signature / timestamp materiality
drift (a doc predating a code change) is a future blame-gated enhancement.
Classify each content chunk's text for structured rationale signals and stamp
the strongest onto its EdgeMotivates edges via Meta[signal]: "adr" (a decision
record — implemented_by / supersedes / ADR id / Decision heading), "rfc2119" (a
MUST / SHALL requirement), "ticket" (a JIRA-style id), or "lexical" (a plain name
match). Consumers — the why query and doc_staleness — can trust-weight a curated
decision above an incidental mention.

Deterministic; runs in the index pass with no LLM. The budget-gated LLM mining
pass (default off, never inside init) is a future opt-in that fires only on
structured-signal-but-unresolved chunks.
@zzet zzet merged commit 00fb678 into main Jun 21, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant