Skip to content

Proposal for V2 of the Multi-Agent Observability System #4

@coygeek

Description

@coygeek

Based on my analysis and the official documentation, here is a proposal for V2 of the Multi-Agent Observability System.

This version fundamentally re-architects the project to align with Claude Code's best practices, leveraging its native OpenTelemetry support for metrics and logs while repurposing the existing frontend/backend for high-value qualitative events. The result is a system that is orders of magnitude more efficient, far more data-rich, and aligned with industry standards for observability.

Here is the new README.md followed by the necessary code and configuration files.


README.md (V2)

Multi-Agent Observability System V2

This revised system provides a professional-grade, real-time monitoring solution for Claude Code agents. It leverages Claude Code's native OpenTelemetry (OTel) integration for quantitative metrics and logs, while reserving hooks for powerful, qualitative event-driven enhancements.

This V2 architecture corrects the critical flaws of the original approach by eliminating inefficient, per-event LLM calls and capturing far richer data directly from the agent's core.

🏗️ V2 Architecture

The system is now split into two complementary data pipelines:

1. Quantitative Observability (The Core): For metrics and logs.
Claude Code → OpenTelemetry Collector → Prometheus (Metrics) & Loki (Logs) → Grafana

2. Qualitative Events (The Enhancement): For rich, context-aware events.
Claude Code Hooks → Python Scripts → Bun Server → SQLite → WebSocket → Vue Client

✨ Key Improvements in V2

  • ⚡ Extreme Efficiency: By removing the LLM summarizer from the hooks, the system is now orders of magnitude faster and cheaper. A simple ls command no longer triggers two expensive API calls.
  • 📊 Richer Data: The native OTel pipeline captures critical data unavailable to hooks, including token counts, API costs, request latencies, and cache usage.
  • 🛠️ Correct Use of Hooks: Hooks are now used for their intended purpose: providing deterministic control, capturing high-value qualitative data (like full session transcripts), and triggering real-time notifications (e.g., TTS).
  • 📈 Industry-Standard Tooling: V2 is built on a standard, robust observability stack (OTel, Prometheus, Grafana, Loki) that is scalable and widely used in production environments.
  • 🚀 One-Command Setup: The entire observability stack, including the frontend and backend, is now orchestrated with a single docker-compose up command.

🚀 Quick Start

Prerequisites:

1. Configure Environment Variables

Create a .env file in the project root and add your Anthropic API key. This will be used by both Claude Code and the hook scripts.

# .env
ANTHROPIC_API_KEY="sk-ant-..."
CLAUDE_CODE_ENABLE_TELEMETRY=1
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
OTEL_LOG_USER_PROMPTS=1 # Set to 1 to log full prompt text

2. Launch the System

Start the entire observability stack, including the Vue frontend and Bun backend:

docker-compose up --build

3. Set Up Claude Code Hooks

Copy the improved .claude directory to any project you want to monitor:

cp -R .claude /path/to/your/project/

Note: The hook scripts have been updated to remove the inefficient LLM summarizer.

4. Start Coding!

Run Claude Code in the configured project directory. Your terminal must have the environment variables from Step 1 loaded (you can source .env or add them to your shell profile).

🔧 Component Details

Observability Core (Docker Compose)

  • OTel Collector: Receives OTel data from Claude Code and exports it to Prometheus and Loki.
  • Prometheus: Stores all quantitative metrics (costs, token counts, etc.).
  • Loki: Stores all logs and event data (API requests, tool usage, etc.).
  • Grafana: Visualizes all data from Prometheus and Loki in a pre-built dashboard.

Qualitative Event System (Hooks + Vue App)

The original application now serves a more focused, powerful purpose.

  • Hooks (.claude/hooks):
    • No longer calls an LLM on every event.
    • The stop.py hook now captures the entire chat transcript at the end of a session, providing invaluable qualitative context.
    • The notification.py hook remains for real-time TTS alerts.
  • Bun Server & Vue Client:
    • The Vue app now visualizes a stream of high-signal events like session completions (with full transcripts) and user notifications, complementing the quantitative data in Grafana.

Implementation Files for V2

Here are the new and modified files required to implement this improved system.

1. docker-compose.yml (New)

This file orchestrates the entire system. Place it in the project root.

version: '3.8'

services:
  # Observability Stack
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yml"]
    volumes:
      - ./observability/otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    command: ["--config.file=/etc/prometheus/prometheus.yml"]
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  loki:
    image: grafana/loki:latest
    container_name: loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./observability/grafana/provisioning:/etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus
      - loki

  # Original Application
  server:
    build:
      context: ./apps/server
    container_name: multi-agent-server
    ports:
      - "4000:4000"
    volumes:
      - ./apps/server/events.db:/app/events.db
    networks:
      - monitoring

  client:
    build:
      context: ./apps/client
    container_name: multi-agent-client
    ports:
      - "5173:5173"
    depends_on:
      - server
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:

2. OpenTelemetry & Grafana Configuration

Create a new directory observability in the project root to hold these files.

observability/otel-collector-config.yml:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  logging:
    loglevel: debug
  prometheus:
    endpoint: "0.0.0.0:8889"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, loki]

observability/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

observability/grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

observability/grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1

providers:
- name: 'default'
  orgId: 1
  folder: ''
  type: file
  disableDeletion: false
  editable: true
  options:
    path: /etc/grafana/provisioning/dashboards

observability/grafana/provisioning/dashboards/claude-code-dashboard.json:
(A minimal dashboard definition to get started. You can build this out in the Grafana UI.)

{
  "__inputs": [],
  "__requires": [],
  "annotations": { "list": [] },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "panels": [
    {
      "type": "stat",
      "title": "Total Cost (USD)",
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
      "targets": [{ "expr": "sum(claude_code_cost_usage_total)", "legendFormat": "Total Cost" }],
      "options": { "reduceOptions": { "calcs": ["last"], "fields": "" }, "textMode": "auto", "colorMode": "value", "graphMode": "area", "unit": "currencyUSD" }
    },
    {
      "type": "stat",
      "title": "Total Sessions",
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
      "targets": [{ "expr": "sum(claude_code_session_count_total)" }],
      "options": { "reduceOptions": { "calcs": ["last"], "fields": "" }, "textMode": "auto", "colorMode": "value", "graphMode": "area" }
    },
    {
      "type": "piechart",
      "title": "Token Usage by Type",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
      "targets": [{ "expr": "sum by (type) (claude_code_token_usage_total)" }],
      "options": { "displayLabels": ["name", "percent"], "pieType": "donut" }
    },
    {
      "type": "logs",
      "title": "Latest Tool Decisions & API Errors",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "targets": [{ "datasource": { "type": "loki", "uid": "Loki" }, "expr": "{job=\"otel-collector\"} | json | event_name=~\"claude_code.tool_decision|claude_code.api_error\"" }],
      "options": { "showTime": true, "showLabels": true, "wrapLines": true, "prettifyLogMessage": true }
    }
  ],
  "refresh": "10s",
  "time": { "from": "now-1h", "to": "now" },
  "title": "Claude Code Observability V2"
}

3. Modified Hook Scripts

The only change required is removing the inefficient summarizer.

.claude/hooks/send_event.py (Modified)
The --summarize flag and its logic should be removed.

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.8"
# dependencies = ["python-dotenv"]
# ///
import json
import sys
import os
import argparse
import urllib.request
import urllib.error
from datetime import datetime

# ... (keep send_event_to_server function as is) ...

def main():
    parser = argparse.ArgumentParser(description='Send Claude Code hook events')
    parser.add_argument('--source-app', required=True, help='Source application name')
    parser.add_argument('--event-type', required=True, help='Hook event type')
    parser.add_argument('--server-url', default='http://localhost:4000/events', help='Server URL')
    # The --add-chat flag is now primarily used by the Stop hook.
    parser.add_argument('--add-chat', action='store_true', help='Include chat transcript if available')
    
    args = parser.parse_args()
    
    try:
        input_data = json.load(sys.stdin)
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON input: {e}", file=sys.stderr)
        sys.exit(1)
    
    event_data = {
        'source_app': args.source_app,
        'session_id': input_data.get('session_id', 'unknown'),
        'hook_event_type': args.event_type,
        'payload': input_data,
        'timestamp': int(datetime.now().timestamp() * 1000)
    }
    
    if args.add_chat and 'transcript_path' in input_data:
        # ... (keep existing chat transcript logic) ...
    
    # Send to server (the summarizer call is now gone)
    send_event_to_server(event_data, args.server_url)
    
    sys.exit(0)

if __name__ == '__main__':
    main()

.claude/settings.json (Modified)
Update the commands to remove the --summarize flag.

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/pre_tool_use.py" },
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type PreToolUse" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type PostToolUse" }
        ]
      }
    ],
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "uv run .claude/hooks/notification.py --notify" },
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type Notification" }
        ]
      }
    ],
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          // The most valuable qualitative event: capture the full transcript at the end.
          { "type": "command", "command": "uv run .claude/hooks/send_event.py --source-app cc-hooks-v2 --event-type Stop --add-chat" }
        ]
      }
    ]
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions