Content & knowledge layer: stream content, link it to code, and answer "why does this exist"#134
Merged
Conversation
…on semaphore Large content files (PDFs, office docs, datasets) were read whole and parsed in-process at NumCPU concurrency with no memory bound, so a content-heavy repo could exhaust memory during a bulk index when a cluster of large files materialised simultaneously (whole file plus its parse tree, times every worker). Admit each file into extraction under a weighted bytes-in-flight budget before it is read, so large files serialise instead of piling up; code files are tiny and flow freely. A file larger than the whole budget is admitted alone (weight clamped) so the semaphore can never deadlock; the held weight is released as soon as extraction returns. The budget defaults on (512 MiB, configurable via index.max_parse_bytes_in_flight; 0 disables) and is a no-op for ordinary source repos. Carry walk-time file size on walkedFile so admission needs no extra stat. Mitigates #120 (concurrency-driven OOM); streaming content extraction follows.
…-by-page Content extractors that work a unit at a time (PDF, and the office/text formats to follow) no longer need the whole file resident. Add an optional StreamingExtractor capability — ExtractStream(path, io.ReaderAt, size, emit) — mirroring the existing opt-in PreParser pattern. The bulk indexer prefers it on the in-process route: it hands the extractor a file handle instead of os.ReadFile'ing the whole file, so peak memory is O(one page), not O(file). The crash-isolation subprocess route keeps the byte protocol. PDFExtractor implements it, reading each page through the io.ReaderAt; the page-walk is shared with the byte-path Extract. A malformed document still isolates to just its file node via the per-document recover. Towards #120: a single very large document can no longer be materialised whole.
Add three content extractors that emit KindDoc chunks discriminated by Meta[asset_kind]: pptx -> one "slide" per slide, xlsx -> one "sheet_region" per worksheet (shared strings resolved), txt -> one "section" per windowed chunk. All implement StreamingExtractor, so the office formats are read from the zip central directory one part at a time via archive/zip per-entry readers and the XML is token-streamed — a workbook that unzips to hundreds of MB of XML is never materialised whole. Per-chunk text is capped; the shared-string table is byte-bounded. Registered before the generic forest grammars so they claim their extensions; exact-basename detection (CMakeLists.txt) still wins over the new .txt mapping.
…nodes Tag every content chunk and its file node (pdf/pptx/xlsx/txt) with Meta[data_class]=content so the retrieval profile and why-layer can scope to the content corpus without a separate store. Markdown stays untagged in the code graph. No physical AllNodes() partition is needed: the heavy global passes (reach, clone, dead_code) already key on code kinds, so KindDoc chunks never load them — the partition is logical. Add a metadata-only DataAssetExtractor for pure data/binary assets (parquet/npy/npz/lance/arrow/feather): one KindFile node with size and a size-capped streamed sha256, tagged data_class=data, never parsed. They become listable and linkable without feeding a binary blob to any grammar.
Content chunks (pdf / office / text, tagged data_class=content) are KindDoc nodes, so they were already excluded from the code corpus and lumped in with docs. Add a first-class `corpus: content` that narrows to the data_class=content chunks (excluding Markdown prose), routes through the existing doc-retrieval channel, and engages prose-mode reranking — code-structural signals suppressed, text + semantic channels lifted — so content retrieval is embedding-first while code stays BM25-first. One query surface, two profiles; no second index.
Introduce the causal "why this exists" relations: KindRationale (a node projecting a development-memory decision / incident / constraint), EdgeMotivates (knowledge -> code symbol), and its cross-repo parallel EdgeCrossRepoMotivates, registered in CrossRepoKindFor / BaseKindForCrossRepo / BaseKindsForCrossRepo so DetectCrossRepoEdges materialises the motivates parallel across repo boundaries. New enum constants and Meta only — no struct field — so the wire-contract golden and gcx consumers are unchanged.
…dRationale A store_memory decision / incident / constraint / invariant that is load-bearing (pinned or importance >= 3) and anchored to code now projects into the graph as a KindRationale node with EdgeMotivates edges to its anchored symbols and files, so "why does X exist" is one graph hop from the code it explains. The memory sidecar stays the system of record; the projection is a derived view, rebuilt idempotently by evicting a sentinel virtual file (.gortex/rationale) and re-adding the fresh eligible set. store_memory reconciles on every write; the why query reconciles on read for memories that predate the daemon.
Add a content->code linking pass to the global graph passes: every content KindDoc chunk (data_class=content) is scanned for code symbol names via the artifact reference scanner (whole-token, 4-char floor, 200 refs/chunk, 1 MiB cap), minting EdgeMotivates from the chunk to each named symbol. It runs before DetectCrossRepoEdges, so a chunk naming a symbol in another repo gets its cross_repo_motivates parallel for free (EdgeMotivates is in BaseKindsForCrossRepo). A single budget — max(2000, 10% of live edges) — bounds the edges the why-layer can add, so it can never approach doubling the graph; over budget the pass stops and logs rather than silently truncating. Exposes SymbolNameIndex / ScanSymbolRefs from the artifacts linker so the index is built once and every chunk scanned against it.
A new `why` tool answers "why does this code exist": a one-hop walk over the incoming EdgeMotivates edges of a symbol, returning the knowledge that motivates it — projected store_memory decisions / incidents (KindRationale) and the content documents whose text names it (KindDoc). Curated rationale ranks before lexical content matches. It reconciles the memory projection on read so decisions stored before the daemon started are visible without a reindex. When nothing links the symbol it returns a note pointing at corpus:content search and store_memory.
…code Add `analyze kind=doc_staleness`: a deterministic, advisory pass over every EdgeMotivates that flags the knowledge sources (content chunks and projected rationale) whose code references have gone stale — "dangling" when the named symbol is absent from the graph, "pending" when the target is unresolved (e.g. a cross-repo symbol not yet indexed). Grouped by source, ranked dangling-first. Zero false positives by design: it needs no git history and only reports references that genuinely no longer resolve. Signature / timestamp materiality drift (a doc predating a code change) is a future blame-gated enhancement.
Classify each content chunk's text for structured rationale signals and stamp the strongest onto its EdgeMotivates edges via Meta[signal]: "adr" (a decision record — implemented_by / supersedes / ADR id / Decision heading), "rfc2119" (a MUST / SHALL requirement), "ticket" (a JIRA-style id), or "lexical" (a plain name match). Consumers — the why query and doc_staleness — can trust-weight a curated decision above an incidental mention. Deterministic; runs in the index pass with no LLM. The budget-gated LLM mining pass (default off, never inside init) is a future opt-in that fires only on structured-signal-but-unresolved chunks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Models documents and content (PDF, office formats, plain text) as a first-class data class alongside code, links that content to the code it explains, and adds a graph-first why query that recovers the reasoning behind a symbol — the decisions and documents that motivated it.
Motivated by the
gortex initOOM on a content-heavy repo (#120): the content lane streams large documents instead of materialising them. The bytes-in-flight admission semaphore here composes with main's recent large-read gate — both the per-file byte budget and the large-read concurrency cap are active.What's included
Content lane
StreamingExtractorcapability — PDFs stream page-by-page through anio.ReaderAt; office formats (pptx/xlsx) stream one zip entry at a time with token-streamed XML; text is windowed. A large document is never read whole.Meta[asset_kind](slide / sheet / page / section); metadata-only nodes for binary/data files (parquet / npy / lance / …) that are never parsed.corpus: contentsearch profile — embedding-first, prose-tuned ranking, distinct from code's BM25-first — on one query surface, no second index.Why-layer
KindRationale/EdgeMotivatesgraph vocabulary plus its cross-repo parallel.store_memorydecisions / incidents / constraints into the graph as traversable rationale (sidecar stays system-of-record; reconcile on read).max(2000, 10% of live edges)so the layer can never approach doubling the graph), with structured-signal mining (ADR / RFC-2119 / ticket) upgrading link confidence above an incidental name match.why <symbol>query: a one-hop traversal over incomingEdgeMotivates, curated rationale first, content links second.Knowledge health
analyze kind=doc_staleness— deterministically flags documents that reference symbols which no longer exist (dangling) or are not yet indexed (pending). Needs no git history and never false-positives.Notable decisions
data_class), not a second datastore.Testing
go build ./...clean;go vetandgolangci-lintclean on touched packages.-racetests; fullinternal/indexersuite green (624); wire-contract golden unchanged (enum constants +Metaonly, no struct fields, sogcxconsumers are unaffected).Relates to #120.