Skip to content
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Build semantic knowledge tree with LLM guardrails #27

@thewildofficial

Description

@thewildofficial

Summary

Build and maintain a semantic knowledge tree that automatically organizes ingested assets into a hierarchical taxonomy (e.g., Books → Authors → Genres) using LLM-driven reasoning with guardrail enforcement.

Scope

Complete Implementation (Phase 8 - Critical)

Implement the full pipeline described in docs/technical_specification.md § Semantic Knowledge Organization:

Stage 1: Similarity Graph Construction (src/knowledge/organizer.py)

Objective: Build unified similarity graphs from embeddings across modalities

Tasks:

  • Construct modality-specific kNN graphs from embeddings:
    • Media embeddings (512-d CLIP vectors)
    • Document chunk embeddings (768-d text vectors)
    • Audio transcript embeddings (768-d text vectors)
  • Merge graphs into unified similarity matrix with configurable modality weighting
  • Apply community detection algorithms (Louvain or Leiden) to identify candidate clusters
  • Compute cluster cohesion scores and cross-cluster similarity metrics
  • Store intermediate graph structures for incremental updates

Technology:

  • networkx: Graph construction and community detection
  • scikit-learn: Similarity metrics and clustering
  • numpy: Matrix operations

Stage 2: LLM-Based Node Labeling (src/knowledge/organizer.py)

Objective: Generate human-readable, semantically meaningful node names

Tasks:

  • For each candidate cluster, select representative assets:
    • Choose centroids and high-degree nodes
    • Sample diverse examples across modality
  • Build structured prompts with:
    • Asset metadata (titles, tags, descriptions)
    • Cluster statistics (size, modalities, coherence)
    • Existing tree context (parent nodes, sibling nodes)
  • Call LLM API with prompt:
    prompt = '''
    Analyze this cluster of assets and propose a descriptive node name for our knowledge tree.
    
    Cluster contains {count} assets:
    - {sample_titles}
    
    Tags: {common_tags}
    Categories: {categories}
    
    Existing tree structure: {parent_path}
    
    Propose:
    1. Node name (2-5 words, descriptive)
    2. Hierarchy level (1-5, where 1 is root)
    3. Confidence score (0.0-1.0)
    4. Rationale (1-2 sentences explaining why this grouping makes sense)
    
    Return ONLY valid JSON:
    {{
      "node_name": "string",
      "hierarchy_level": integer,
      "confidence": float,
      "rationale": "string"
    }}
    '''
  • Parse structured JSON response
  • Implement fallback heuristics for low-confidence or failed LLM calls:
    • Use most common tags
    • Combine primary categories
    • Generate from cluster centroid nearest neighbors

Configuration:

  • LLM_PROVIDER: gemini, openai, anthropic
  • LLM_MODEL_NAME: gemini-2.5-flash (default)
  • LLM_CONFIDENCE_THRESHOLD: 0.7 (minimum confidence to use LLM result)

Stage 3: Guardrails & Validation (src/knowledge/guardrails.py)

Objective: Enforce policies to prevent unsafe or invalid tree structures

Tasks:

  • Implement policy validators:
    • PII Detection: Block nodes containing email addresses, phone numbers, SSNs
    • Profanity Filter: Reject offensive or inappropriate terms
    • Sensitive Content: Flag medical, financial, or legal content for review
    • Duplicate Prevention: Check for semantic similarity to existing nodes (threshold: 0.85)
    • Cycle Detection: Ensure no circular parent-child relationships
  • Validate hierarchy constraints:
    • Maximum depth: 5 levels
    • Maximum children per node: 100
    • Minimum cluster size: 3 assets
  • Store validation results in knowledge_node.guardrail_status:
    • pending: Awaiting admin review
    • approved: Passed all checks or approved by admin
    • rejected: Failed validation or rejected by admin
  • Generate approval requests for admin UI with:
    • Proposed node details
    • Validation warnings/failures
    • Representative asset previews
    • LLM rationale

Technology:

  • presidio-analyzer: PII detection
  • Custom regex patterns for profanity/sensitive terms
  • Graph algorithms for cycle detection

Stage 4: Tree Assembly & Prioritization (src/knowledge/tree_models.py)

Objective: Build parent-child relationships and compute priority scores

Tasks:

  • Assemble relationships based on:
    • LLM-recommended hierarchy levels
    • Semantic similarity between node embeddings
    • Guardrail validation results
  • Compute priority scores using weighted formula:
    priority = (
        0.4 * recency_score +        # Recent assets ranked higher
        0.3 * access_frequency +      # User interaction history
        0.2 * semantic_coherence +    # Cluster tightness
        0.1 * size_normalization      # Penalize very small/large clusters
    )
  • Update cluster centroids by aggregating embeddings for each node:
    node_embedding = mean([asset.embedding for asset in node.assets])
  • Persist relationships in knowledge_edge table with rationale
  • Publish diff events for downstream systems (UI, search re-ranker)
  • Generate changelog entries in lineage table

Stage 5: Continuous Learning (src/knowledge/organizer.py)

Objective: Maintain tree quality as new assets arrive

Tasks:

  • Monitor new assets for taxonomy drift:
    • Compute distance to nearest existing nodes
    • Trigger re-balancing when threshold exceeded (>20% of assets orphaned)
  • Implement partial re-balancing:
    • Only recompute affected subtrees
    • Preserve stable nodes to avoid UI churn
  • Admin review workflow:
    • Display pending node proposals in admin UI
    • Allow approval/rejection with rationale
    • Support manual override of LLM decisions
    • Enable drag-and-drop tree restructuring
  • Version control for tree structure:
    • Store snapshots before major changes
    • Maintain rollback capability
    • Track change history with timestamps and user IDs

Database Schema

Implement tables from docs/technical_specification.md § Data Models & Database Schema:

-- Hierarchy nodes
CREATE TABLE knowledge_node (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    path TEXT[] NOT NULL,
    depth INTEGER NOT NULL,
    summary TEXT,
    guardrail_status TEXT CHECK (guardrail_status IN ('pending', 'approved', 'rejected')) DEFAULT 'pending',
    embedding vector(768),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(path)
);

CREATE INDEX idx_knowledge_node_depth ON knowledge_node(depth);
CREATE INDEX idx_knowledge_node_path_gin ON knowledge_node USING GIN(path);
CREATE INDEX idx_knowledge_node_embedding_hnsw ON knowledge_node USING hnsw (embedding vector_cosine_ops);

-- Parent-child relationships
CREATE TABLE knowledge_edge (
    parent_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
    child_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
    priority_score REAL,
    rationale TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (parent_id, child_id)
);

CREATE INDEX idx_knowledge_edge_parent ON knowledge_edge(parent_id);
CREATE INDEX idx_knowledge_edge_child ON knowledge_edge(child_id);

Search Integration

Extend search pipeline to use knowledge tree context (docs/technical_specification.md § Search & Retrieval Pipeline):

  • Add knowledge tree path to search results:
    {
      "id": "asset-uuid-1",
      "similarity": 0.92,
      "knowledge_path": ["Books", "Ursula K. Le Guin", "Sci-Fi"],
      "knowledge_node_id": "node-uuid-1"
    }
  • Boost results that match user's current tree navigation context
  • Allow filtering by knowledge node (?knowledge_node_id=uuid)

Acceptance Criteria

  • Similarity graphs constructed from multi-modal embeddings
  • LLM generates node names with confidence scores and rationales
  • Guardrails validate nodes against policy rules (PII, profanity, duplicates, cycles)
  • knowledge_node and knowledge_edge tables populated
  • Admin UI displays pending nodes for review
  • Approved nodes appear in search results with breadcrumbs
  • Priority scores computed using recency, frequency, coherence, and size
  • Continuous learning monitors for drift and triggers re-balancing
  • Lineage tracks all tree changes with timestamps and rationale
  • E2E test: upload 50 related assets → nodes proposed → admin approves → search shows hierarchy
  • Performance: tree refresh completes in < 5 minutes for 10K assets

Configuration

Add to settings.py:

# Knowledge Organization
KNOWLEDGE_TREE_ENABLED = True
KNOWLEDGE_TREE_REFRESH_INTERVAL_MINUTES = 10
KNOWLEDGE_GUARDRAIL_THRESHOLD = 0.72
LLM_PROVIDER = "gemini"  # or openai, anthropic
LLM_MODEL_NAME = "gemini-2.5-flash"
LLM_CONFIDENCE_THRESHOLD = 0.7
MAX_TREE_DEPTH = 5
MAX_CHILDREN_PER_NODE = 100
MIN_CLUSTER_SIZE = 3
DUPLICATE_NODE_SIMILARITY_THRESHOLD = 0.85

Testing Requirements

  • Unit tests for similarity graph construction
  • Unit tests for LLM prompt generation and response parsing
  • Unit tests for each guardrail policy (PII, profanity, duplicates, cycles)
  • Unit tests for priority score calculation
  • Integration test: full pipeline from graph to approved nodes
  • Integration test: search with knowledge tree context
  • Admin workflow test: pending → approved → search

References

  • docs/technical_specification.md § Semantic Knowledge Organization (lines 1711-1767)
  • docs/technical_specification.md § Search & Retrieval Pipeline (lines 1588-1707)
  • docs/technical_specification.md § Data Models & Database Schema (lines 2034-2072)
  • docs/technical_specification.md § Configuration Reference (lines 2292-2298)
  • docs/technical_specification.md § Implementation Checklist Phase 8 (lines 2368-2374)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions