Build semantic knowledge tree with LLM guardrails

## Summary
Build and maintain a semantic knowledge tree that automatically organizes ingested assets into a hierarchical taxonomy (e.g., Books → Authors → Genres) using LLM-driven reasoning with guardrail enforcement.

## Scope

### Complete Implementation (Phase 8 - Critical)
Implement the full pipeline described in **docs/technical_specification.md § Semantic Knowledge Organization**:

#### Stage 1: Similarity Graph Construction (`src/knowledge/organizer.py`)
**Objective:** Build unified similarity graphs from embeddings across modalities

**Tasks:**
- Construct modality-specific kNN graphs from embeddings:
  - Media embeddings (512-d CLIP vectors)
  - Document chunk embeddings (768-d text vectors)
  - Audio transcript embeddings (768-d text vectors)
- Merge graphs into unified similarity matrix with configurable modality weighting
- Apply community detection algorithms (Louvain or Leiden) to identify candidate clusters
- Compute cluster cohesion scores and cross-cluster similarity metrics
- Store intermediate graph structures for incremental updates

**Technology:**
- `networkx`: Graph construction and community detection
- `scikit-learn`: Similarity metrics and clustering
- `numpy`: Matrix operations

#### Stage 2: LLM-Based Node Labeling (`src/knowledge/organizer.py`)
**Objective:** Generate human-readable, semantically meaningful node names

**Tasks:**
- For each candidate cluster, select representative assets:
  - Choose centroids and high-degree nodes
  - Sample diverse examples across modality
- Build structured prompts with:
  - Asset metadata (titles, tags, descriptions)
  - Cluster statistics (size, modalities, coherence)
  - Existing tree context (parent nodes, sibling nodes)
- Call LLM API with prompt:
  ```python
  prompt = '''
  Analyze this cluster of assets and propose a descriptive node name for our knowledge tree.
  
  Cluster contains {count} assets:
  - {sample_titles}
  
  Tags: {common_tags}
  Categories: {categories}
  
  Existing tree structure: {parent_path}
  
  Propose:
  1. Node name (2-5 words, descriptive)
  2. Hierarchy level (1-5, where 1 is root)
  3. Confidence score (0.0-1.0)
  4. Rationale (1-2 sentences explaining why this grouping makes sense)
  
  Return ONLY valid JSON:
  {{
    "node_name": "string",
    "hierarchy_level": integer,
    "confidence": float,
    "rationale": "string"
  }}
  '''
  ```
- Parse structured JSON response
- Implement fallback heuristics for low-confidence or failed LLM calls:
  - Use most common tags
  - Combine primary categories
  - Generate from cluster centroid nearest neighbors

**Configuration:**
- `LLM_PROVIDER`: gemini, openai, anthropic
- `LLM_MODEL_NAME`: gemini-2.5-flash (default)
- `LLM_CONFIDENCE_THRESHOLD`: 0.7 (minimum confidence to use LLM result)

#### Stage 3: Guardrails & Validation (`src/knowledge/guardrails.py`)
**Objective:** Enforce policies to prevent unsafe or invalid tree structures

**Tasks:**
- Implement policy validators:
  - **PII Detection:** Block nodes containing email addresses, phone numbers, SSNs
  - **Profanity Filter:** Reject offensive or inappropriate terms
  - **Sensitive Content:** Flag medical, financial, or legal content for review
  - **Duplicate Prevention:** Check for semantic similarity to existing nodes (threshold: 0.85)
  - **Cycle Detection:** Ensure no circular parent-child relationships
- Validate hierarchy constraints:
  - Maximum depth: 5 levels
  - Maximum children per node: 100
  - Minimum cluster size: 3 assets
- Store validation results in `knowledge_node.guardrail_status`:
  - `pending`: Awaiting admin review
  - `approved`: Passed all checks or approved by admin
  - `rejected`: Failed validation or rejected by admin
- Generate approval requests for admin UI with:
  - Proposed node details
  - Validation warnings/failures
  - Representative asset previews
  - LLM rationale

**Technology:**
- `presidio-analyzer`: PII detection
- Custom regex patterns for profanity/sensitive terms
- Graph algorithms for cycle detection

#### Stage 4: Tree Assembly & Prioritization (`src/knowledge/tree_models.py`)
**Objective:** Build parent-child relationships and compute priority scores

**Tasks:**
- Assemble relationships based on:
  - LLM-recommended hierarchy levels
  - Semantic similarity between node embeddings
  - Guardrail validation results
- Compute priority scores using weighted formula:
  ```python
  priority = (
      0.4 * recency_score +        # Recent assets ranked higher
      0.3 * access_frequency +      # User interaction history
      0.2 * semantic_coherence +    # Cluster tightness
      0.1 * size_normalization      # Penalize very small/large clusters
  )
  ```
- Update cluster centroids by aggregating embeddings for each node:
  ```python
  node_embedding = mean([asset.embedding for asset in node.assets])
  ```
- Persist relationships in `knowledge_edge` table with rationale
- Publish diff events for downstream systems (UI, search re-ranker)
- Generate changelog entries in `lineage` table

#### Stage 5: Continuous Learning (`src/knowledge/organizer.py`)
**Objective:** Maintain tree quality as new assets arrive

**Tasks:**
- Monitor new assets for taxonomy drift:
  - Compute distance to nearest existing nodes
  - Trigger re-balancing when threshold exceeded (>20% of assets orphaned)
- Implement partial re-balancing:
  - Only recompute affected subtrees
  - Preserve stable nodes to avoid UI churn
- Admin review workflow:
  - Display pending node proposals in admin UI
  - Allow approval/rejection with rationale
  - Support manual override of LLM decisions
  - Enable drag-and-drop tree restructuring
- Version control for tree structure:
  - Store snapshots before major changes
  - Maintain rollback capability
  - Track change history with timestamps and user IDs

### Database Schema
Implement tables from **docs/technical_specification.md § Data Models & Database Schema**:

```sql
-- Hierarchy nodes
CREATE TABLE knowledge_node (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    path TEXT[] NOT NULL,
    depth INTEGER NOT NULL,
    summary TEXT,
    guardrail_status TEXT CHECK (guardrail_status IN ('pending', 'approved', 'rejected')) DEFAULT 'pending',
    embedding vector(768),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(path)
);

CREATE INDEX idx_knowledge_node_depth ON knowledge_node(depth);
CREATE INDEX idx_knowledge_node_path_gin ON knowledge_node USING GIN(path);
CREATE INDEX idx_knowledge_node_embedding_hnsw ON knowledge_node USING hnsw (embedding vector_cosine_ops);

-- Parent-child relationships
CREATE TABLE knowledge_edge (
    parent_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
    child_id UUID REFERENCES knowledge_node(id) ON DELETE CASCADE,
    priority_score REAL,
    rationale TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (parent_id, child_id)
);

CREATE INDEX idx_knowledge_edge_parent ON knowledge_edge(parent_id);
CREATE INDEX idx_knowledge_edge_child ON knowledge_edge(child_id);
```

### Search Integration
Extend search pipeline to use knowledge tree context (**docs/technical_specification.md § Search & Retrieval Pipeline**):

- Add knowledge tree path to search results:
  ```json
  {
    "id": "asset-uuid-1",
    "similarity": 0.92,
    "knowledge_path": ["Books", "Ursula K. Le Guin", "Sci-Fi"],
    "knowledge_node_id": "node-uuid-1"
  }
  ```
- Boost results that match user's current tree navigation context
- Allow filtering by knowledge node (`?knowledge_node_id=uuid`)

## Acceptance Criteria
- [ ] Similarity graphs constructed from multi-modal embeddings
- [ ] LLM generates node names with confidence scores and rationales
- [ ] Guardrails validate nodes against policy rules (PII, profanity, duplicates, cycles)
- [ ] `knowledge_node` and `knowledge_edge` tables populated
- [ ] Admin UI displays pending nodes for review
- [ ] Approved nodes appear in search results with breadcrumbs
- [ ] Priority scores computed using recency, frequency, coherence, and size
- [ ] Continuous learning monitors for drift and triggers re-balancing
- [ ] Lineage tracks all tree changes with timestamps and rationale
- [ ] E2E test: upload 50 related assets → nodes proposed → admin approves → search shows hierarchy
- [ ] Performance: tree refresh completes in < 5 minutes for 10K assets

## Configuration
Add to `settings.py`:
```python
# Knowledge Organization
KNOWLEDGE_TREE_ENABLED = True
KNOWLEDGE_TREE_REFRESH_INTERVAL_MINUTES = 10
KNOWLEDGE_GUARDRAIL_THRESHOLD = 0.72
LLM_PROVIDER = "gemini"  # or openai, anthropic
LLM_MODEL_NAME = "gemini-2.5-flash"
LLM_CONFIDENCE_THRESHOLD = 0.7
MAX_TREE_DEPTH = 5
MAX_CHILDREN_PER_NODE = 100
MIN_CLUSTER_SIZE = 3
DUPLICATE_NODE_SIMILARITY_THRESHOLD = 0.85
```

## Testing Requirements
- Unit tests for similarity graph construction
- Unit tests for LLM prompt generation and response parsing
- Unit tests for each guardrail policy (PII, profanity, duplicates, cycles)
- Unit tests for priority score calculation
- Integration test: full pipeline from graph to approved nodes
- Integration test: search with knowledge tree context
- Admin workflow test: pending → approved → search

## References
- docs/technical_specification.md § Semantic Knowledge Organization (lines 1711-1767)
- docs/technical_specification.md § Search & Retrieval Pipeline (lines 1588-1707)
- docs/technical_specification.md § Data Models & Database Schema (lines 2034-2072)
- docs/technical_specification.md § Configuration Reference (lines 2292-2298)
- docs/technical_specification.md § Implementation Checklist Phase 8 (lines 2368-2374)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build semantic knowledge tree with LLM guardrails #27

Summary

Scope

Complete Implementation (Phase 8 - Critical)

Stage 1: Similarity Graph Construction (`src/knowledge/organizer.py`)

Stage 2: LLM-Based Node Labeling (`src/knowledge/organizer.py`)

Stage 3: Guardrails & Validation (`src/knowledge/guardrails.py`)

Stage 4: Tree Assembly & Prioritization (`src/knowledge/tree_models.py`)

Stage 5: Continuous Learning (`src/knowledge/organizer.py`)

Database Schema

Search Integration

Acceptance Criteria

Configuration

Testing Requirements

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Build semantic knowledge tree with LLM guardrails #27

Description

Summary

Scope

Complete Implementation (Phase 8 - Critical)

Stage 1: Similarity Graph Construction (src/knowledge/organizer.py)

Stage 2: LLM-Based Node Labeling (src/knowledge/organizer.py)

Stage 3: Guardrails & Validation (src/knowledge/guardrails.py)

Stage 4: Tree Assembly & Prioritization (src/knowledge/tree_models.py)

Stage 5: Continuous Learning (src/knowledge/organizer.py)

Database Schema

Search Integration

Acceptance Criteria

Configuration

Testing Requirements

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Stage 1: Similarity Graph Construction (`src/knowledge/organizer.py`)

Stage 2: LLM-Based Node Labeling (`src/knowledge/organizer.py`)

Stage 3: Guardrails & Validation (`src/knowledge/guardrails.py`)

Stage 4: Tree Assembly & Prioritization (`src/knowledge/tree_models.py`)

Stage 5: Continuous Learning (`src/knowledge/organizer.py`)