Content & knowledge layer: stream content, link it to code, and answer "why does this exist" by zzet · Pull Request #134 · zzet/gortex

zzet · 2026-06-20T23:00:23Z

Summary

Models documents and content (PDF, office formats, plain text) as a first-class data class alongside code, links that content to the code it explains, and adds a graph-first why query that recovers the reasoning behind a symbol — the decisions and documents that motivated it.

Motivated by the gortex init OOM on a content-heavy repo (#120): the content lane streams large documents instead of materialising them. The bytes-in-flight admission semaphore here composes with main's recent large-read gate — both the per-file byte budget and the large-read concurrency cap are active.

What's included

Content lane

Bound peak parse memory with a weighted bytes-in-flight semaphore, composed with the large-read gate.
StreamingExtractor capability — PDFs stream page-by-page through an io.ReaderAt; office formats (pptx/xlsx) stream one zip entry at a time with token-streamed XML; text is windowed. A large document is never read whole.
Content sub-kinds via Meta[asset_kind] (slide / sheet / page / section); metadata-only nodes for binary/data files (parquet / npy / lance / …) that are never parsed.
A first-class corpus: content search profile — embedding-first, prose-tuned ranking, distinct from code's BM25-first — on one query surface, no second index.

Why-layer

KindRationale / EdgeMotivates graph vocabulary plus its cross-repo parallel.
Projects store_memory decisions / incidents / constraints into the graph as traversable rationale (sidecar stays system-of-record; reconcile on read).
A budgeted content→code linking pass (whole-token match, capped at max(2000, 10% of live edges) so the layer can never approach doubling the graph), with structured-signal mining (ADR / RFC-2119 / ticket) upgrading link confidence above an incidental name match.
A why <symbol> query: a one-hop traversal over incoming EdgeMotivates, curated rationale first, content links second.

Knowledge health

analyze kind=doc_staleness — deterministically flags documents that reference symbols which no longer exist (dangling) or are not yet indexed (pending). Needs no git history and never false-positives.

Notable decisions

Content and code share one graph store — the why-query is a single traversal, not a cross-system join; the partition is logical (data_class), not a second datastore.
The single graph-size budget keeps the why-layer's worst case at a few percent of edges, never a doubling.
Deferred, documented, default-off: LLM rationale mining, and blame/timestamp-based drift tiers (the deterministic dangling signal ships in their place).

Testing

go build ./... clean; go vet and golangci-lint clean on touched packages.
Per-unit -race tests; full internal/indexer suite green (624); wire-contract golden unchanged (enum constants + Meta only, no struct fields, so gcx consumers are unaffected).

Relates to #120.

…on semaphore Large content files (PDFs, office docs, datasets) were read whole and parsed in-process at NumCPU concurrency with no memory bound, so a content-heavy repo could exhaust memory during a bulk index when a cluster of large files materialised simultaneously (whole file plus its parse tree, times every worker). Admit each file into extraction under a weighted bytes-in-flight budget before it is read, so large files serialise instead of piling up; code files are tiny and flow freely. A file larger than the whole budget is admitted alone (weight clamped) so the semaphore can never deadlock; the held weight is released as soon as extraction returns. The budget defaults on (512 MiB, configurable via index.max_parse_bytes_in_flight; 0 disables) and is a no-op for ordinary source repos. Carry walk-time file size on walkedFile so admission needs no extra stat. Mitigates #120 (concurrency-driven OOM); streaming content extraction follows.

…-by-page Content extractors that work a unit at a time (PDF, and the office/text formats to follow) no longer need the whole file resident. Add an optional StreamingExtractor capability — ExtractStream(path, io.ReaderAt, size, emit) — mirroring the existing opt-in PreParser pattern. The bulk indexer prefers it on the in-process route: it hands the extractor a file handle instead of os.ReadFile'ing the whole file, so peak memory is O(one page), not O(file). The crash-isolation subprocess route keeps the byte protocol. PDFExtractor implements it, reading each page through the io.ReaderAt; the page-walk is shared with the byte-path Extract. A malformed document still isolates to just its file node via the per-document recover. Towards #120: a single very large document can no longer be materialised whole.

Add three content extractors that emit KindDoc chunks discriminated by Meta[asset_kind]: pptx -> one "slide" per slide, xlsx -> one "sheet_region" per worksheet (shared strings resolved), txt -> one "section" per windowed chunk. All implement StreamingExtractor, so the office formats are read from the zip central directory one part at a time via archive/zip per-entry readers and the XML is token-streamed — a workbook that unzips to hundreds of MB of XML is never materialised whole. Per-chunk text is capped; the shared-string table is byte-bounded. Registered before the generic forest grammars so they claim their extensions; exact-basename detection (CMakeLists.txt) still wins over the new .txt mapping.

…nodes Tag every content chunk and its file node (pdf/pptx/xlsx/txt) with Meta[data_class]=content so the retrieval profile and why-layer can scope to the content corpus without a separate store. Markdown stays untagged in the code graph. No physical AllNodes() partition is needed: the heavy global passes (reach, clone, dead_code) already key on code kinds, so KindDoc chunks never load them — the partition is logical. Add a metadata-only DataAssetExtractor for pure data/binary assets (parquet/npy/npz/lance/arrow/feather): one KindFile node with size and a size-capped streamed sha256, tagged data_class=data, never parsed. They become listable and linkable without feeding a binary blob to any grammar.

Content chunks (pdf / office / text, tagged data_class=content) are KindDoc nodes, so they were already excluded from the code corpus and lumped in with docs. Add a first-class `corpus: content` that narrows to the data_class=content chunks (excluding Markdown prose), routes through the existing doc-retrieval channel, and engages prose-mode reranking — code-structural signals suppressed, text + semantic channels lifted — so content retrieval is embedding-first while code stays BM25-first. One query surface, two profiles; no second index.

Introduce the causal "why this exists" relations: KindRationale (a node projecting a development-memory decision / incident / constraint), EdgeMotivates (knowledge -> code symbol), and its cross-repo parallel EdgeCrossRepoMotivates, registered in CrossRepoKindFor / BaseKindForCrossRepo / BaseKindsForCrossRepo so DetectCrossRepoEdges materialises the motivates parallel across repo boundaries. New enum constants and Meta only — no struct field — so the wire-contract golden and gcx consumers are unchanged.

…dRationale A store_memory decision / incident / constraint / invariant that is load-bearing (pinned or importance >= 3) and anchored to code now projects into the graph as a KindRationale node with EdgeMotivates edges to its anchored symbols and files, so "why does X exist" is one graph hop from the code it explains. The memory sidecar stays the system of record; the projection is a derived view, rebuilt idempotently by evicting a sentinel virtual file (.gortex/rationale) and re-adding the fresh eligible set. store_memory reconciles on every write; the why query reconciles on read for memories that predate the daemon.

Add a content->code linking pass to the global graph passes: every content KindDoc chunk (data_class=content) is scanned for code symbol names via the artifact reference scanner (whole-token, 4-char floor, 200 refs/chunk, 1 MiB cap), minting EdgeMotivates from the chunk to each named symbol. It runs before DetectCrossRepoEdges, so a chunk naming a symbol in another repo gets its cross_repo_motivates parallel for free (EdgeMotivates is in BaseKindsForCrossRepo). A single budget — max(2000, 10% of live edges) — bounds the edges the why-layer can add, so it can never approach doubling the graph; over budget the pass stops and logs rather than silently truncating. Exposes SymbolNameIndex / ScanSymbolRefs from the artifacts linker so the index is built once and every chunk scanned against it.

A new `why` tool answers "why does this code exist": a one-hop walk over the incoming EdgeMotivates edges of a symbol, returning the knowledge that motivates it — projected store_memory decisions / incidents (KindRationale) and the content documents whose text names it (KindDoc). Curated rationale ranks before lexical content matches. It reconciles the memory projection on read so decisions stored before the daemon started are visible without a reindex. When nothing links the symbol it returns a note pointing at corpus:content search and store_memory.

…code Add `analyze kind=doc_staleness`: a deterministic, advisory pass over every EdgeMotivates that flags the knowledge sources (content chunks and projected rationale) whose code references have gone stale — "dangling" when the named symbol is absent from the graph, "pending" when the target is unresolved (e.g. a cross-repo symbol not yet indexed). Grouped by source, ranked dangling-first. Zero false positives by design: it needs no git history and only reports references that genuinely no longer resolve. Signature / timestamp materiality drift (a doc predating a code change) is a future blame-gated enhancement.

Classify each content chunk's text for structured rationale signals and stamp the strongest onto its EdgeMotivates edges via Meta[signal]: "adr" (a decision record — implemented_by / supersedes / ADR id / Decision heading), "rfc2119" (a MUST / SHALL requirement), "ticket" (a JIRA-style id), or "lexical" (a plain name match). Consumers — the why query and doc_staleness — can trust-weight a curated decision above an incidental mention. Deterministic; runs in the index pass with no LLM. The budget-gated LLM mining pass (default off, never inside init) is a future opt-in that fires only on structured-signal-but-unresolved chunks.

zzet added 11 commits June 21, 2026 00:30

zzet merged commit 00fb678 into main Jun 21, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content & knowledge layer: stream content, link it to code, and answer "why does this exist"#134

Content & knowledge layer: stream content, link it to code, and answer "why does this exist"#134
zzet merged 11 commits into
mainfrom
feat/content-knowledge-layer

zzet commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzet commented Jun 20, 2026

Summary

What's included

Notable decisions

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant