-
Notifications
You must be signed in to change notification settings - Fork 0
Build semantic knowledge tree with LLM guardrails #27
Description
Summary
Build and maintain a semantic knowledge tree that automatically organizes ingested assets into a hierarchical taxonomy (e.g., Books → Authors → Genres) using LLM-driven reasoning with guardrail enforcement.
Scope
Complete Implementation (Phase 8 - Critical)
Implement the full pipeline described in docs/technical_specification.md § Semantic Knowledge Organization:
Stage 1: Similarity Graph Construction (src/knowledge/organizer.py)
Objective: Build unified similarity graphs from embeddings across modalities
Tasks:
- Construct modality-specific kNN graphs from embeddings:
- Media embeddings (512-d CLIP vectors)
- Document chunk embeddings (768-d text vectors)
- Audio transcript embeddings (768-d text vectors)
- Merge graphs into unified similarity matrix with configurable modality weighting
- Apply community detection algorithms (Louvain or Leiden) to identify candidate clusters
- Compute cluster cohesion scores and cross-cluster similarity metrics
- Store intermediate graph structures for incremental updates
Technology:
networkx: Graph construction and community detectionscikit-learn: Similarity metrics and clusteringnumpy: Matrix operations
Stage 2: LLM-Based Node Labeling (src/knowledge/organizer.py)
Objective: Generate human-readable, semantically meaningful node names
Tasks:
- For each candidate cluster, select representative assets:
- Choose centroids and high-degree nodes
- Sample diverse examples across modality
- Build structured prompts with:
- Asset metadata (titles, tags, descriptions)
- Cluster statistics (size, modalities, coherence)
- Existing tree context (parent nodes, sibling nodes)
- Call LLM API with prompt:
prompt = ''' Analyze this cluster of assets and propose a descriptive node name for our knowledge tree. Cluster contains {count} assets: - {sample_titles} Tags: {common_tags} Categories: {categories} Existing tree structure: {parent_path} Propose: 1. Node name (2-5 words, descriptive) 2. Hierarchy level (1-5, where 1 is root) 3. Confidence score (0.0-1.0) 4. Rationale (1-2 sentences explaining why this grouping makes sense) Return ONLY valid JSON: {{ "node_name": "string", "hierarchy_level": integer, "confidence": float, "rationale": "string" }} '''
- Parse structured JSON response
- Implement fallback heuristics for low-confidence or failed LLM calls:
- Use most common tags
- Combine primary categories
- Generate from cluster centroid nearest neighbors
Configuration:
LLM_PROVIDER: gemini, openai, anthropicLLM_MODEL_NAME: gemini-2.5-flash (default)LLM_CONFIDENCE_THRESHOLD: 0.7 (minimum confidence to use LLM result)
Stage 3: Guardrails & Validation (src/knowledge/guardrails.py)
Objective: Enforce policies to prevent unsafe or invalid tree structures
Tasks:
- Implement policy validators:
- PII Detection: Block nodes containing email addresses, phone numbers, SSNs
- Profanity Filter: Reject offensive or inappropriate terms
- Sensitive Content: Flag medical, financial, or legal content for review
- Duplicate Prevention: Check for semantic similarity to existing nodes (threshold: 0.85)
- Cycle Detection: Ensure no circular parent-child relationships
- Validate hierarchy constraints:
- Maximum depth: 5 levels
- Maximum children per node: 100
- Minimum cluster size: 3 assets
- Store validation results in
knowledge_node.guardrail_status:pending: Awaiting admin reviewapproved: Passed all checks or approved by adminrejected: Failed validation or rejected by admin
- Generate approval requests for admin UI with:
- Proposed node details
- Validation warnings/failures
- Representative asset previews
- LLM rationale
Technology:
presidio-analyzer: PII detection- Custom regex patterns for profanity/sensitive terms
- Graph algorithms for cycle detection
Stage 4: Tree Assembly & Prioritization (src/knowledge/tree_models.py)
Objective: Build parent-child relationships and compute priority scores
Tasks:
- Assemble relationships based on:
- LLM-recommended hierarchy levels
- Semantic similarity between node embeddings
- Guardrail validation results
- Compute priority scores using weighted formula:
priority = ( 0.4 * recency_score + # Recent assets ranked higher 0.3 * access_frequency + # User interaction history 0.2 * semantic_coherence + # Cluster tightness 0.1 * size_normalization # Penalize very small/large clusters )
- Update cluster centroids by aggregating embeddings for each node:
node_embedding = mean([asset.embedding for asset in node.assets])
- Persist relationships in
knowledge_edgetable with rationale - Publish diff events for downstream systems (UI, search re-ranker)
- Generate changelog entries in
lineagetable
Stage 5: Continuous Learning (src/knowledge/organizer.py)
Objective: Maintain tree quality as new assets arrive
Tasks:
- Monitor new assets for taxonomy drift:
- Compute distance to nearest existing nodes
- Trigger re-balancing when threshold exceeded (>20% of assets orphaned)
- Implement partial re-balancing:
- Only recompute affected subtrees
- Preserve stable nodes to avoid UI churn
- Admin review workflow:
- Display pending node proposals in admin UI
- Allow approval/rejection with rationale
- Support manual override of LLM decisions
- Enable drag-and-drop tree restructuring
- Version control for tree structure:
- Store snapshots before major changes
- Maintain rollback capability
- Track change history with timestamps and user IDs
Database Schema
Implement tables from docs/technical_specification.md § Data Models & Database Schema:
-- Hierarchy nodes
CREATE TABLE knowledge_node (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
path TEXT[] NOT NULL,
depth INTEGER NOT NULL,
summary TEXT,
guardrail_status TEXT CHECK (guardrail_status IN ('pending', 'approved', 'rejected')) DEFAULT 'pending',
embedding vector(768),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(path)
);
CREATE INDEX idx_knowledge_node_depth ON knowledge_node(depth);
CREATE INDEX idx_knowledge_node_path_gin ON knowledge_node USING GIN(path);
CREATE INDEX idx_knowledge_node_embedding_hnsw ON knowledge_node USING hnsw (embedding vector_cosine_ops);
-- Parent-child relationships
CREATE TABLE knowledge_edge (
parent_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
child_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
priority_score REAL,
rationale TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (parent_id, child_id)
);
CREATE INDEX idx_knowledge_edge_parent ON knowledge_edge(parent_id);
CREATE INDEX idx_knowledge_edge_child ON knowledge_edge(child_id);Search Integration
Extend search pipeline to use knowledge tree context (docs/technical_specification.md § Search & Retrieval Pipeline):
- Add knowledge tree path to search results:
{ "id": "asset-uuid-1", "similarity": 0.92, "knowledge_path": ["Books", "Ursula K. Le Guin", "Sci-Fi"], "knowledge_node_id": "node-uuid-1" } - Boost results that match user's current tree navigation context
- Allow filtering by knowledge node (
?knowledge_node_id=uuid)
Acceptance Criteria
- Similarity graphs constructed from multi-modal embeddings
- LLM generates node names with confidence scores and rationales
- Guardrails validate nodes against policy rules (PII, profanity, duplicates, cycles)
-
knowledge_nodeandknowledge_edgetables populated - Admin UI displays pending nodes for review
- Approved nodes appear in search results with breadcrumbs
- Priority scores computed using recency, frequency, coherence, and size
- Continuous learning monitors for drift and triggers re-balancing
- Lineage tracks all tree changes with timestamps and rationale
- E2E test: upload 50 related assets → nodes proposed → admin approves → search shows hierarchy
- Performance: tree refresh completes in < 5 minutes for 10K assets
Configuration
Add to settings.py:
# Knowledge Organization
KNOWLEDGE_TREE_ENABLED = True
KNOWLEDGE_TREE_REFRESH_INTERVAL_MINUTES = 10
KNOWLEDGE_GUARDRAIL_THRESHOLD = 0.72
LLM_PROVIDER = "gemini" # or openai, anthropic
LLM_MODEL_NAME = "gemini-2.5-flash"
LLM_CONFIDENCE_THRESHOLD = 0.7
MAX_TREE_DEPTH = 5
MAX_CHILDREN_PER_NODE = 100
MIN_CLUSTER_SIZE = 3
DUPLICATE_NODE_SIMILARITY_THRESHOLD = 0.85Testing Requirements
- Unit tests for similarity graph construction
- Unit tests for LLM prompt generation and response parsing
- Unit tests for each guardrail policy (PII, profanity, duplicates, cycles)
- Unit tests for priority score calculation
- Integration test: full pipeline from graph to approved nodes
- Integration test: search with knowledge tree context
- Admin workflow test: pending → approved → search
References
- docs/technical_specification.md § Semantic Knowledge Organization (lines 1711-1767)
- docs/technical_specification.md § Search & Retrieval Pipeline (lines 1588-1707)
- docs/technical_specification.md § Data Models & Database Schema (lines 2034-2072)
- docs/technical_specification.md § Configuration Reference (lines 2292-2298)
- docs/technical_specification.md § Implementation Checklist Phase 8 (lines 2368-2374)