Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
name: E2E Smoke Tests

on:
pull_request:
branches: [master]
push:
branches: [master]
workflow_dispatch:
inputs:
backend:
description: 'Backend to test against'
required: false
default: 'thanos'
type: choice
options:
- thanos
- prometheus

jobs:
e2e:
runs-on: ubuntu-latest
timeout-minutes: 20
strategy:
fail-fast: false
matrix:
backend: [thanos, prometheus]
steps:
- uses: actions/checkout@v4

- name: Set up JDK 17
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: 17
cache: maven

- name: Install sshpass
run: sudo apt-get install -y sshpass

- name: Build plugin
run: mvn clean install -DskipTests

- name: Deploy KAR
run: cp assembly/kar/target/opennms-cortex-tss-plugin.kar e2e/opennms-overlay/deploy/

- name: Start ${{ matrix.backend }} stack
working-directory: e2e
run: |
docker compose --profile ${{ matrix.backend }} up -d
echo "Waiting for OpenNMS..."
for i in $(seq 1 60); do
STATUS=$(curl -s -o /dev/null -w '%{http_code}' -u admin:admin http://localhost:8980/opennms/rest/info 2>/dev/null || echo "000")
if [ "$STATUS" = "200" ]; then
echo "OpenNMS is up after ~${i}0s"
break
fi
sleep 10
done

- name: Verify all containers are running
run: |
echo "=== Container status ==="
docker ps -a --format 'table {{.Names}}\t{{.Status}}'
echo ""
# Fail fast if any e2e container exited
EXITED=$(docker ps -a --filter "status=exited" --format '{{.Names}}' | grep e2e || true)
if [ -n "$EXITED" ]; then
echo "ERROR: Containers have exited: $EXITED"
for C in $EXITED; do
echo "=== Logs: $C ==="
docker logs "$C" 2>&1 | tail -20
done
exit 1
fi
echo "All containers running"

- name: Install Cortex plugin feature
run: |
ssh-keygen -R "[localhost]:8101" 2>/dev/null || true
# Wait for Karaf SSH to be ready, then install the feature
for attempt in $(seq 1 12); do
sshpass -p admin ssh -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 8101 admin@localhost \
"feature:install opennms-plugins-cortex-tss" 2>&1 && break
sleep 5
done
# Wait for the feature to fully start (KAR extraction + bundle activation)
echo "Waiting for health check to pass..."
for attempt in $(seq 1 24); do
HEALTH=$(sshpass -p admin ssh -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 8101 admin@localhost \
"opennms:health-check" 2>&1 || true)
echo "$HEALTH"
if echo "$HEALTH" | grep -q "Everything is awesome"; then
echo "Health check passed on attempt $attempt"
exit 0
fi
sleep 5
done
echo "Health check did not pass within 2 minutes"
exit 1

- name: Wait for metrics
run: |
echo "Waiting for metrics to flow (up to 5 minutes)..."
for i in $(seq 1 60); do
COUNT=$(curl -s http://localhost:9090/api/v1/label/__name__/values 2>/dev/null \
| python3 -c "import sys,json; print(len(json.load(sys.stdin)['data']))" 2>/dev/null || echo "0")
if [ "$COUNT" -gt "10" ]; then
echo "Got $COUNT metrics after ~${i}x5s"
exit 0
fi
sleep 5
done
echo "ERROR: No metrics after 5 minutes"
exit 1

- name: Run smoke tests
working-directory: e2e
run: bash smoke-test.sh --backend ${{ matrix.backend }}

- name: Collect logs on failure
if: failure()
working-directory: e2e
run: |
echo "=== All containers (including exited) ==="
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
echo ""

# Show logs for ALL e2e containers, not just opennms
for CONTAINER in $(docker ps -a --format '{{.Names}}' | grep -E "e2e-"); do
echo "=== Logs: $CONTAINER ==="
docker logs "$CONTAINER" 2>&1 | tail -50
echo ""
done

OPENNMS=$(docker ps -a --format '{{.Names}}' | grep opennms | head -1)
if [ -n "$OPENNMS" ]; then
echo "=== Karaf log ==="
docker exec "$OPENNMS" cat /opt/opennms/logs/karaf.log 2>/dev/null | tail -100
fi

- name: Tear down
if: always()
working-directory: e2e
run: docker compose --profile ${{ matrix.backend }} down -v 2>/dev/null || true

unit-tests:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4

- name: Set up JDK 17
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: 17
cache: maven

- name: Run unit tests
run: mvn clean install
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,9 @@ java/bin/
*.swp
*.iml

# E2E test artifacts
*.kar
e2e/opennms-overlay/etc/org.opennms.plugins.tss.cortex.cfg

# Tool workspace artifacts
docs/superpowers/
74 changes: 74 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# CLAUDE.md

## Project Overview

Cortex TSS Plugin for OpenNMS — a `TimeSeriesStorage` implementation that stores metrics in Cortex/Prometheus-compatible backends via remote write protocol. Deployed as a KAR file into OpenNMS Karaf.

- **Package**: `org.opennms.timeseries.cortex`
- **Core classes**: `CortexTSS` (implements `TimeSeriesStorage`), `CortexTSSConfig`, `ResultMapper`
- **Build**: `mvn clean install` (standard Maven, Java 17)
- **Unit tests**: `mvn test` — 14 tests (CortexTSSTest + ResultMapperTest)
- **KAR output**: `assembly/kar/target/opennms-cortex-tss-plugin.kar`
- **License**: AGPL v3 — all Java files MUST have the license header

## E2E Test Harness

Located in `e2e/`. Uses docker-compose with Prometheus or Thanos backends.

### One-command E2E (preferred)
```bash
# Builds plugin, deploys KAR, starts stack, installs feature, waits for data, runs 45 tests, tears down:
./e2e/run-e2e.sh --backend thanos

# Skip rebuild if KAR already exists:
./e2e/run-e2e.sh --backend thanos --no-build

# Keep stack running after tests (for debugging):
./e2e/run-e2e.sh --backend thanos --no-teardown
```

### Manual steps (if needed)
```bash
cd e2e
docker-compose --profile thanos up -d
# (wait for OpenNMS + install plugin feature + wait for data)
./smoke-test.sh --backend thanos
```

### CI
GitHub Actions runs E2E on every PR against both Prometheus and Thanos backends. See `.github/workflows/e2e.yml`.

### Critical Rules for E2E Infrastructure

1. **Pin all image versions explicitly.** Never use `:latest` tags. Never reference `localhost/` images. Every image in docker-compose.yml must use a specific released version tag (e.g., `opennms/horizon:35.0.4`, `thanosio/thanos:v0.35.1`).

2. **Develop and test against released OpenNMS versions only.** Never develop against SNAPSHOT builds. The E2E harness must pass against the current stable release. If a feature requires unreleased OpenNMS behavior, the test must be marked as `skip` with a comment noting the required version.

3. **Tests must assert observable behavior.** Test assertions must check API responses, config file contents, metric data, HTTP status codes, or feature status. Never grep log files for specific messages — log output is an implementation detail that varies by version and log configuration.

4. **Collection intervals must be 30 seconds for testing.** Override the default 300s interval in overlay configs. Waiting 5 minutes per collection cycle during E2E is unacceptable.

5. **KAR must be pre-built and placed in `e2e/opennms-overlay/deploy/`.** The E2E harness does not build the plugin — it deploys a pre-built KAR. After `mvn install`, copy the KAR:
```bash
cp assembly/kar/target/opennms-cortex-tss-plugin.kar e2e/opennms-overlay/deploy/
```

6. **The Cortex plugin feature must be explicitly installed** after OpenNMS starts. The KAR auto-deploys and registers the feature repo, but the feature itself needs `feature:install opennms-plugins-cortex-tss` via Karaf SSH (port 8101, admin/admin).

### Docker Compose Gotchas (CI vs Local)

The E2E harness runs on both podman (macOS local) and Docker (GitHub Actions CI). Key differences:

- **Volume permissions**: Docker named volumes are root-owned. Images running as non-root (e.g., Thanos runs as uid 1001) will get `permission denied`. Fix: `user: "0:0"` in docker-compose.yml for affected services.
- **Container naming**: podman-compose uses `_` separators (`e2e_opennms_1`), Docker Compose v2 uses `-` separators (`e2e-opennms-1`). Scripts must handle both: `grep -E "[_-]opennms"`.
- **`set -e` + `grep -c`**: `grep -c` returns exit code 1 on zero matches, which kills `set -e` scripts. Use `|| true` on grep commands, or avoid `set -e`.
- **Matrix `fail-fast`**: GitHub Actions matrix defaults to `fail-fast: true`, canceling sibling jobs on first failure. Always set `fail-fast: false` for independent E2E profiles.
- **Always collect ALL container logs on failure** — not just OpenNMS. A crashed sidecar (thanos-receive, postgres) can cause misleading symptoms (DNS failures, connection refused).

## Key Conventions

- Config PID: `org.opennms.plugins.tss.cortex` — properties go in `.cfg` files
- Blueprint XML wires the OSGi service — constructor args must match `CortexTSSConfig`
- Wire protocol: Prometheus remote write (protobuf + Snappy) for writes; Prometheus HTTP API for reads
- The pre-existing `FIXME: Data loss` in `CortexTSS.java` is an upstream issue — do not remove or "fix" it without addressing the actual retry/backpressure problem
- Never remove code outside the scope of the current task. If something is brittle, fix it — don't delete it.
Loading
Loading