Proposal for V2 of the Multi-Agent Observability System

Based on my analysis and the official documentation, here is a proposal for **V2 of the Multi-Agent Observability System**.

This version fundamentally re-architects the project to align with Claude Code's best practices, leveraging its native OpenTelemetry support for metrics and logs while repurposing the existing frontend/backend for high-value qualitative events. The result is a system that is orders of magnitude more efficient, far more data-rich, and aligned with industry standards for observability.

Here is the new `README.md` followed by the necessary code and configuration files.

---

### **`README.md` (V2)**

# Multi-Agent Observability System V2

This revised system provides a professional-grade, real-time monitoring solution for Claude Code agents. It leverages Claude Code's **native OpenTelemetry (OTel) integration** for quantitative metrics and logs, while reserving hooks for powerful, qualitative event-driven enhancements.

This V2 architecture corrects the critical flaws of the original approach by eliminating inefficient, per-event LLM calls and capturing far richer data directly from the agent's core.

## 🏗️ V2 Architecture

The system is now split into two complementary data pipelines:

**1. Quantitative Observability (The Core):** For metrics and logs.
`Claude Code → OpenTelemetry Collector → Prometheus (Metrics) & Loki (Logs) → Grafana`

**2. Qualitative Events (The Enhancement):** For rich, context-aware events.
`Claude Code Hooks → Python Scripts → Bun Server → SQLite → WebSocket → Vue Client`



## ✨ Key Improvements in V2

*   **⚡ Extreme Efficiency:** By removing the LLM summarizer from the hooks, the system is now orders of magnitude faster and cheaper. A simple `ls` command no longer triggers two expensive API calls.
*   **📊 Richer Data:** The native OTel pipeline captures critical data unavailable to hooks, including **token counts, API costs, request latencies, and cache usage.**
*   **🛠️ Correct Use of Hooks:** Hooks are now used for their intended purpose: providing deterministic control, capturing high-value qualitative data (like full session transcripts), and triggering real-time notifications (e.g., TTS).
*   **📈 Industry-Standard Tooling:** V2 is built on a standard, robust observability stack (OTel, Prometheus, Grafana, Loki) that is scalable and widely used in production environments.
*   **🚀 One-Command Setup:** The entire observability stack, including the frontend and backend, is now orchestrated with a single `docker-compose up` command.

## 🚀 Quick Start

**Prerequisites:**
*   [Claude Code](https://docs.anthropic.com/en/docs/claude-code)
*   [Docker and Docker Compose](https://docs.docker.com/get-docker/)

**1. Configure Environment Variables**

Create a `.env` file in the project root and add your Anthropic API key. This will be used by both Claude Code and the hook scripts.

```bash
# .env
ANTHROPIC_API_KEY="sk-ant-..."
CLAUDE_CODE_ENABLE_TELEMETRY=1
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
OTEL_LOG_USER_PROMPTS=1 # Set to 1 to log full prompt text
```

**2. Launch the System**

Start the entire observability stack, including the Vue frontend and Bun backend:

```bash
docker-compose up --build
```

**3. Set Up Claude Code Hooks**

Copy the improved `.claude` directory to any project you want to monitor:

```bash
cp -R .claude /path/to/your/project/
```
*Note: The hook scripts have been updated to remove the inefficient LLM summarizer.*

**4. Start Coding!**

Run Claude Code in the configured project directory. Your terminal must have the environment variables from Step 1 loaded (you can `source .env` or add them to your shell profile).

*   **View Metrics & Logs:** Open Grafana at [**http://localhost:3000**](http://localhost:3000) (user: `admin`, pass: `admin`). The Claude Code dashboard will be pre-installed.
*   **View Qualitative Events:** Open the Vue app at [**http://localhost:5173**](http://localhost:5173).

## 🔧 Component Details

### Observability Core (Docker Compose)

*   **OTel Collector:** Receives OTel data from Claude Code and exports it to Prometheus and Loki.
*   **Prometheus:** Stores all quantitative metrics (costs, token counts, etc.).
*   **Loki:** Stores all logs and event data (API requests, tool usage, etc.).
*   **Grafana:** Visualizes all data from Prometheus and Loki in a pre-built dashboard.

### Qualitative Event System (Hooks + Vue App)

The original application now serves a more focused, powerful purpose.

*   **Hooks (`.claude/hooks`):**
    *   No longer calls an LLM on every event.
    *   The `stop.py` hook now captures the **entire chat transcript** at the end of a session, providing invaluable qualitative context.
    *   The `notification.py` hook remains for real-time TTS alerts.
*   **Bun Server & Vue Client:**
    *   The Vue app now visualizes a stream of high-signal events like session completions (with full transcripts) and user notifications, complementing the quantitative data in Grafana.

---

### **Implementation Files for V2**

Here are the new and modified files required to implement this improved system.

#### 1. `docker-compose.yml` (New)

This file orchestrates the entire system. Place it in the project root.

```yaml
version: '3.8'

services:
  # Observability Stack
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yml"]
    volumes:
      - ./observability/otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    command: ["--config.file=/etc/prometheus/prometheus.yml"]
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  loki:
    image: grafana/loki:latest
    container_name: loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./observability/grafana/provisioning:/etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus
      - loki

  # Original Application
  server:
    build:
      context: ./apps/server
    container_name: multi-agent-server
    ports:
      - "4000:4000"
    volumes:
      - ./apps/server/events.db:/app/events.db
    networks:
      - monitoring

  client:
    build:
      context: ./apps/client
    container_name: multi-agent-client
    ports:
      - "5173:5173"
    depends_on:
      - server
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
```

#### 2. OpenTelemetry & Grafana Configuration

Create a new directory `observability` in the project root to hold these files.

**`observability/otel-collector-config.yml`:**
```yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  logging:
    loglevel: debug
  prometheus:
    endpoint: "0.0.0.0:8889"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, loki]
```

**`observability/prometheus.yml`:**
```yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
```

**`observability/grafana/provisioning/datasources/datasources.yml`:**
```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
```

**`observability/grafana/provisioning/dashboards/dashboards.yml`:**
```yaml
apiVersion: 1

providers:
- name: 'default'
  orgId: 1
  folder: ''
  type: file
  disableDeletion: false
  editable: true
  options:
    path: /etc/grafana/provisioning/dashboards
```

**`observability/grafana/provisioning/dashboards/claude-code-dashboard.json`:**
(A minimal dashboard definition to get started. You can build this out in the Grafana UI.)
```json
{
  "__inputs": [],
  "__requires": [],
  "annotations": { "list": [] },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "panels": [
    {
      "type": "stat",
      "title": "Total Cost (USD)",
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
      "targets": [{ "expr": "sum(claude_code_cost_usage_total)", "legendFormat": "Total Cost" }],
      "options": { "reduceOptions": { "calcs": ["last"], "fields": "" }, "textMode": "auto", "colorMode": "value", "graphMode": "area", "unit": "currencyUSD" }
    },
    {
      "type": "stat",
      "title": "Total Sessions",
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
      "targets": [{ "expr": "sum(claude_code_session_count_total)" }],
      "options": { "reduceOptions": { "calcs": ["last"], "fields": "" }, "textMode": "auto", "colorMode": "value", "graphMode": "area" }
    },
    {
      "type": "piechart",
      "title": "Token Usage by Type",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
      "targets": [{ "expr": "sum by (type) (claude_code_token_usage_total)" }],
      "options": { "displayLabels": ["name", "percent"], "pieType": "donut" }
    },
    {
      "type": "logs",
      "title": "Latest Tool Decisions & API Errors",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "targets": [{ "datasource": { "type": "loki", "uid": "Loki" }, "expr": "{job=\"otel-collector\"} | json | event_name=~\"claude_code.tool_decision|claude_code.api_error\"" }],
      "options": { "showTime": true, "showLabels": true, "wrapLines": true, "prettifyLogMessage": true }
    }
  ],
  "refresh": "10s",
  "time": { "from": "now-1h", "to": "now" },
  "title": "Claude Code Observability V2"
}
```

#### 3. Modified Hook Scripts

The only change required is removing the inefficient summarizer.

**`.claude/hooks/send_event.py` (Modified)**
The `--summarize` flag and its logic should be removed.

```python
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.8"
# dependencies = ["python-dotenv"]
# ///
import json
import sys
import os
import argparse
import urllib.request
import urllib.error
from datetime import datetime

# ... (keep send_event_to_server function as is) ...

def main():
    parser = argparse.ArgumentParser(description='Send Claude Code hook events')
    parser.add_argument('--source-app', required=True, help='Source application name')
    parser.add_argument('--event-type', required=True, help='Hook event type')
    parser.add_argument('--server-url', default='http://localhost:4000/events', help='Server URL')
    # The --add-chat flag is now primarily used by the Stop hook.
    parser.add_argument('--add-chat', action='store_true', help='Include chat transcript if available')
    
    args = parser.parse_args()
    
    try:
        input_data = json.load(sys.stdin)
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON input: {e}", file=sys.stderr)
        sys.exit(1)
    
    event_data = {
        'source_app': args.source_app,
        'session_id': input_data.get('session_id', 'unknown'),
        'hook_event_type': args.event_type,
        'payload': input_data,
        'timestamp': int(datetime.now().timestamp() * 1000)
    }
    
    if args.add_chat and 'transcript_path' in input_data:
        # ... (keep existing chat transcript logic) ...
    
    # Send to server (the summarizer call is now gone)
    send_event_to_server(event_data, args.server_url)
    
    sys.exit(0)

if __name__ == '__main__':
    main()
```

**`.claude/settings.json` (Modified)**
Update the commands to remove the `--summarize` flag.

```json
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/pre_tool_use.py" },
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type PreToolUse" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type PostToolUse" }
        ]
      }
    ],
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/notification.py --notify" },
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type Notification" }
        ]
      }
    ],
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          // The most valuable qualitative event: capture the full transcript at the end.
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type Stop --add-chat" }
        ]
      }
    ]
  }
}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for V2 of the Multi-Agent Observability System #4

`README.md` (V2)

Multi-Agent Observability System V2

🏗️ V2 Architecture

✨ Key Improvements in V2

🚀 Quick Start

🔧 Component Details

Observability Core (Docker Compose)

Qualitative Event System (Hooks + Vue App)

Implementation Files for V2

1. `docker-compose.yml` (New)

2. OpenTelemetry & Grafana Configuration

3. Modified Hook Scripts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal for V2 of the Multi-Agent Observability System #4

Description

README.md (V2)

Multi-Agent Observability System V2

🏗️ V2 Architecture

✨ Key Improvements in V2

🚀 Quick Start

🔧 Component Details

Observability Core (Docker Compose)

Qualitative Event System (Hooks + Vue App)

Implementation Files for V2

1. docker-compose.yml (New)

2. OpenTelemetry & Grafana Configuration

3. Modified Hook Scripts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`README.md` (V2)

1. `docker-compose.yml` (New)