Retrospective Consortium DB Linking Analysis

# LLM Consortium Database Analysis - Phase 1 Report (Final - Rev 2)

## 1. Problem Statement & Goal

The `responses` table in `/home/thomas/.config/io.datasette.llm/logs.db` needs retrospective population of two columns:
*   `consortium_id`: A unique identifier to group related requests (members, arbiters, final output) within a single consortium run.
*   `is_arbiter_response`: A flag (0, 1, or 2) to identify the specific role of a response within a run (0=member/output, 1=final arbiter, 2=intermediate arbiter).

The goal of Phase 1 is to investigate the data, devise a reliable strategy for grouping and identification based on user guidance, and plan the implementation.

## 2. Investigation Findings

### 2.1. Database Schema & Structure
*   Confirmed schema: `id`, `model`, `prompt`, `response`, `datetime_utc`, `consortium_id`, `is_arbiter_response`, etc.
*   Total records: **20,778**.
*   Crucially, each record has a **unique** `conversation_id`. **No existing field groups related consortium run records.**
*   Record order by `id` is the key sequencing factor.

### 2.2. Key Patterns and Resources
*   **Consortium Models List:** Located at `/home/thomas/Projects/llm/plugins/Utilities/llm-consortium/llm_consortium/consortium_models.txt`, containing 100+ specific model names indicating the final output of a run.
*   **Deterministic Link (Final Output <-> Final Arbiter):** The record *immediately preceding* (by `id`) a consortium model record is reliably the **final arbiter response** for that run. This is the primary anchor for grouping.
*   **Arbiter Prompts:** Records acting as arbiters (both final and intermediate) typically contain an `<arbiter_prompt>` tag structure in their `prompt` field, wrapping `<original_prompt>` and `<model_responses>` which contain snippets of member responses.
*   **Member Linking:** Member responses can be linked to a final arbiter by finding records whose `response` content exactly matches snippets found within the final arbiter's `<model_responses>`.
*   **Intermediate Arbiters:** Other records containing `<arbiter_prompt>` exist and belong to the same run.

### 2.3. Policy Decisions (User Confirmation)
*   **Intermediate Arbiters:** Should receive the **same `consortium_id`** as the final arbiter/output. They will be flagged `is_arbiter_response = 2` to distinguish them from the final arbiter (`is_arbiter_response = 1`).
*   **Missing/Malformed Final Arbiter:** If the record preceding a consortium output is invalid (e.g., missing, or another consortium output), the assignment for that run should be **skipped** and **logged** for manual review.
*   **Time Window (Search Optimization):** A time window (e.g., up to 1 hour before the final arbiter) should be used *only* to limit the search space for candidate member/intermediate records for performance. The definitive linking mechanism remains content/pattern matching based on the final arbiter's prompt.

## 3. Final Linking & Identification Strategy

A two-phase approach leveraging the deterministic link and user policies:

### Phase 1: Identify Final Arbiter-Consortium Output Pairs & Assign Base IDs

1.  **Load Data:** Load consortium model names. Read all records from `responses` table, ordered by `id`.
2.  **Iterate & Identify Pairs:** Process records sequentially. When a record `R_consortium` has a model name matching the consortium list:
    *   Look at the immediately preceding record `R_arbiter` (by `id`).
    *   **Validation:** Check if `R_arbiter` exists and is *not* also a consortium model.
    *   **If Valid Pair:**
        *   Generate a unique KSUID-style `consortium_id`.
        *   Prepare updates:
            *   `R_consortium`: `consortium_id` = new ID, `is_arbiter_response` = 0
            *   `R_arbiter`: `consortium_id` = new ID, `is_arbiter_response` = 1 (Final Arbiter)
        *   Store the `consortium_id`, `R_arbiter`'s `id`, `datetime_utc`, and `prompt` for Phase 2.
    *   **If Invalid Pair:**
        *   Log the `id` of `R_consortium` and the issue (missing/invalid preceding record) for manual review.
        *   Do *not* assign a `consortium_id`.

### Phase 2: Link Members & Intermediate Arbiters to Groups

1.  **Iterate Through Identified Groups:** For each `consortium_id` successfully assigned in Phase 1:
    *   Retrieve the associated final arbiter's details (`id`, `datetime_utc`, `prompt`).
    *   **Parse Final Arbiter Prompt:** Extract member response snippets from the `<model_responses>` section. Handle potential parsing errors.
    *   **Define Search Window (Optimization):** Calculate the start time (e.g., final arbiter `datetime_utc` - 1 hour) and end time (final arbiter `datetime_utc`). This window *only* limits the candidate search space for performance.
    *   **Query Candidates:** Find all records `R_candidate` such that:
        *   `R_candidate.datetime_utc` is within the search window.
        *   `R_candidate.id` is less than the final arbiter's `id`.
        *   `R_candidate.consortium_id` is currently NULL (to avoid reprocessing).
    *   **Process Candidates (Content/Pattern Matching is Key):** For each `R_candidate`:
        *   **Match Members:** If `R_candidate.response` exactly matches one of the extracted member snippets:
            *   Prepare update: `R_candidate`: `consortium_id` = current ID, `is_arbiter_response` = 0
        *   **Match Intermediate Arbiters:** If `R_candidate.prompt` contains `<arbiter_prompt>`:
            *   Prepare update: `R_candidate`: `consortium_id` = current ID, `is_arbiter_response` = 2 (Intermediate Arbiter)
            *   *(Ensure this doesn't overwrite a member match if an intermediate arbiter's response happens to match a snippet - prioritize the arbiter=2 flag).*

### Phase 3: Database Update

1.  **Consolidate Updates:** Collect all prepared update statements from Phase 1 and Phase 2.
2.  **Execute Transaction:** Apply all updates within a single database transaction for atomicity.

## 4. Implementation Plan (Python Script: `analyze_consortium_logs.py`)

***Corrected Mermaid Flowchart:***
```mermaid
flowchart TD
    A[Start] --> B[Backup Database]
    B --> C[Load Consortium Models]
    C --> D[Read All Records]
    D --> E[Process Records]
    E --> F[Identify Consortium-Arbiter Pairs]
    F --> G[Generate IDs]
    G --> H[Process Member Records]
    H --> I[Update Database]
    I --> J[End]
```

### 4.1. Key Implementation Points
*   **Database Interaction:** Use `sqlite3` library. Fetch records efficiently. Use transactions for updates.
*   **KSUID Generation:** Use the `ksuid` library (`pip install ksuid`).
*   **Arbiter Prompt Parsing:** Use robust XML parsing (e.g., `xml.etree.ElementTree`) to handle potential variations or errors in the `<arbiter_prompt>` structure.
*   **Exact String Matching:** Implement precise comparison of `response` content to extracted snippets.
*   **Date/Time Handling:** Use Python's `datetime` module for reliable window calculations (e.g., `- timedelta(hours=1)`). Ensure timezone awareness if needed (UTC seems standard).
*   **Logging:** Implement comprehensive logging for progress, assigned groups, skipped records, parsing errors, and database updates.

## 5. Technical Implementation Snippets (Conceptual - Updated Time Window)

```python
import sqlite3
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta
from ksuid import ksuid # Requires: pip install ksuid

# --- Phase 1 (remains the same) ---
def process_records_phase1(conn, consortium_models):
    # ... (as defined previously) ...
    pass 

# --- Phase 2 (Updated Time Window) ---
def extract_snippets_from_arbiter(prompt_text):
     # ... (as defined previously) ...
     pass

def process_group_phase2(conn, consortium_id, group_info):
    updates_phase2 = []
    final_arbiter_dt = datetime.fromisoformat(group_info['final_arbiter_dt_utc'].replace('+00:00', '')) # Ensure compatible format
    # Use potentially larger window (e.g., 1 hour) for search optimization
    window_start_dt = final_arbiter_dt - timedelta(hours=1) 
    
    member_snippets = extract_snippets_from_arbiter(group_info['final_arbiter_prompt'])
    if not member_snippets:
        print(f"Warning: No member snippets found for consortium_id {consortium_id}")

    # Query candidates within the time window, before the final arbiter, and not yet assigned
    query = """
        SELECT id, prompt, response 
        FROM responses 
        WHERE datetime(datetime_utc) BETWEEN datetime(?) AND datetime(?) 
          AND id < ? 
          AND consortium_id IS NULL 
        ORDER BY id DESC 
    """ 
    
    candidates_cursor = conn.execute(query, (
        window_start_dt.isoformat(), 
        final_arbiter_dt.isoformat(), 
        group_info['final_arbiter_id']
    ))

    matched_ids_in_phase2 = set() # Track IDs updated in this phase to prioritize arbiter=2

    for candidate in candidates_cursor:
        candidate_id = candidate['id']
        
        # Check for Intermediate Arbiter first (higher priority for flagging)
        is_intermediate_arbiter = '<arbiter_prompt>' in candidate['prompt']
        
        if is_intermediate_arbiter:
            if candidate_id not in matched_ids_in_phase2:
                updates_phase2.append({
                    'id': candidate_id, 
                    'consortium_id': consortium_id, 
                    'is_arbiter_response': 2 # Intermediate Arbiter
                })
                matched_ids_in_phase2.add(candidate_id)
        
        # Check for Member match only if not already flagged as intermediate arbiter
        elif member_snippets and candidate_id not in matched_ids_in_phase2:
            # Perform EXACT match - careful with whitespace/newlines
            if candidate['response'] in member_snippets:
                 updates_phase2.append({
                     'id': candidate_id, 
                     'consortium_id': consortium_id, 
                     'is_arbiter_response': 0 # Member
                 })
                 matched_ids_in_phase2.add(candidate_id)

    # Filter out potential duplicates if logic allows (though matched_ids_in_phase2 should prevent it)
    # final_updates_phase2 = list({update['id']: update for update in updates_phase2}.values())


    return updates_phase2

# --- Main Execution Flow ---
# 1. Backup DB
# 2. Load consortium models
# 3. Connect to DB
# 4. updates1, groups2, errors = process_records_phase1(conn, models)
# 5. all_updates = list(updates1) # Use a list directly
# 6. for cid, info in groups2.items():
# 7.    updates2 = process_group_phase2(conn, cid, info)
# 8.    all_updates.extend(updates2)
# 9. Apply all_updates in a transaction using executemany more effectively
#    (Separate lists for consortium_id updates and is_arbiter_response updates)
# 10. Log summary and errors
```

## 6. Final Check Before Proceeding (Phase 2)

*   Mermaid diagram syntax corrected.
*   Time window clarification incorporated (search optimization, potentially 1 hour).
*   Strategy remains focused on the reliable `id`-based link for final pairs and content/pattern matching for members/intermediate arbiters within the search window.
*   Flagging scheme (0, 1, 2) and error handling are defined.

This plan seems complete and addresses the latest feedback. Requesting final approval to proceed to Phase 2: Script Development & Execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrospective Consortium DB Linking Analysis #7

LLM Consortium Database Analysis - Phase 1 Report (Final - Rev 2)

1. Problem Statement & Goal

2. Investigation Findings

2.1. Database Schema & Structure

2.2. Key Patterns and Resources

2.3. Policy Decisions (User Confirmation)

3. Final Linking & Identification Strategy

Phase 1: Identify Final Arbiter-Consortium Output Pairs & Assign Base IDs

Phase 2: Link Members & Intermediate Arbiters to Groups

Phase 3: Database Update

4. Implementation Plan (Python Script: `analyze_consortium_logs.py`)

4.1. Key Implementation Points

5. Technical Implementation Snippets (Conceptual - Updated Time Window)

6. Final Check Before Proceeding (Phase 2)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Retrospective Consortium DB Linking Analysis #7

Description

LLM Consortium Database Analysis - Phase 1 Report (Final - Rev 2)

1. Problem Statement & Goal

2. Investigation Findings

2.1. Database Schema & Structure

2.2. Key Patterns and Resources

2.3. Policy Decisions (User Confirmation)

3. Final Linking & Identification Strategy

Phase 1: Identify Final Arbiter-Consortium Output Pairs & Assign Base IDs

Phase 2: Link Members & Intermediate Arbiters to Groups

Phase 3: Database Update

4. Implementation Plan (Python Script: analyze_consortium_logs.py)

4.1. Key Implementation Points

5. Technical Implementation Snippets (Conceptual - Updated Time Window)

6. Final Check Before Proceeding (Phase 2)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. Implementation Plan (Python Script: `analyze_consortium_logs.py`)