Task: Create monitoring dashboards and alerting configuration
Description
Implement comprehensive production monitoring and alerting infrastructure for the Coolify Enterprise platform using Laravel, Grafana, Prometheus, and custom health check systems. This task establishes the observability layer that enables proactive incident detection, performance tracking, and operational insights across the entire multi-tenant enterprise deployment.
The Operational Visibility Challenge:
Operating a multi-tenant enterprise platform presents unique monitoring challenges:
Multi-Tenant Complexity : Track metrics per organization, aggregate globally, detect anomalies
Resource Monitoring : Monitor Terraform deployments, server capacity, queue health, cache performance
Security Events : Track failed authentication, API rate limiting, suspicious activity
Business Metrics : License usage, payment processing, subscription lifecycle events
Performance SLAs : Response times, deployment durations, WebSocket latency
Infrastructure Health : Database connections, Redis memory, disk space, Docker daemon status
Without comprehensive monitoring, production issues remain invisible until customers report them. Silent failures in background jobs, gradual performance degradation, and resource exhaustion can go undetected for hours or days. This task creates the early warning system that transforms reactive firefighting into proactive maintenance.
Solution Architecture:
The monitoring system integrates three complementary layers:
1. Application-Level Metrics (Laravel + Custom Services)
Health check endpoints exposing application state
Database query performance tracking
Job queue monitoring (Horizon integration)
Cache hit rates and Redis memory usage
Custom business metrics (deployments/hour, active licenses, etc.)
2. Infrastructure Monitoring (Prometheus + Node Exporter)
Server CPU, memory, disk, network metrics
Docker container statistics
PostgreSQL connection pool metrics
Redis memory and command statistics
Terraform execution tracking
3. Visualization & Alerting (Grafana + AlertManager)
Real-time dashboards for operations team
Organization-specific dashboards for customers
Alert rules with severity levels (info, warning, critical)
Multi-channel notifications (email, Slack, PagerDuty)
Historical trend analysis and capacity planning
Key Features:
Production Dashboards (Grafana)
System Overview: Health, uptime, request rates, error rates
Resource Dashboard: CPU, memory, disk across all servers
Queue Dashboard: Job throughput, failure rates, queue depth
Terraform Dashboard: Active deployments, success rates, average duration
Organization Dashboard: Per-tenant resource usage and performance
Payment Dashboard: Transaction success rates, revenue metrics
Health Check System (Laravel)
HTTP endpoint /health for load balancer health checks
Detailed diagnostics endpoint /health/detailed (authenticated)
Database connectivity and query performance checks
Redis connectivity and memory checks
Queue worker process verification
Terraform binary availability check
Cloud provider API connectivity check
Disk space and filesystem health check
Alert Configuration (Prometheus AlertManager)
Critical: Database down, queue workers stopped, disk > 90% full
Warning: High error rate (> 1%), slow queries (> 1s), queue depth > 1000
Info: Deployment completed, license expiring soon, payment succeeded
Custom: Organization-specific SLA violations
On-call rotation with PagerDuty integration
Alert deduplication and grouping
Custom Metrics Collection (Laravel Middleware + Jobs)
HTTP request duration histogram
API endpoint hit counts
Deployment success/failure rates
License validation latency
Payment processing success rates
WebSocket connection counts
Organization resource quota usage
Log Aggregation (Optional - Preparation for ELK/Loki)
Structured logging with organization context
Error tracking with stack traces
Audit logging for security events
Performance logging for slow queries
Integration Points:
Existing Infrastructure:
Laravel Horizon : Queue monitoring built-in, expose metrics via Prometheus exporter
Laravel Telescope : Development debugging, disable in production but preserve logging patterns
Reverb WebSocket : Add connection count metrics
Existing Jobs : Add duration tracking to TerraformDeploymentJob, ResourceMonitoringJob, etc.
New Components:
HealthCheckService : Centralized health check logic
MetricsCollector : Custom Prometheus metric collection
AlertingService : Business event → alert mapping
GrafanaProvisioner : Automated dashboard deployment
Why This Task is Critical:
Monitoring is not optional for production systems—it's the difference between knowing issues exist and discovering them through customer complaints. For multi-tenant enterprise platforms, monitoring becomes even more critical:
Customer SLA Compliance : Prove uptime and performance commitments with metrics
Capacity Planning : Identify resource bottlenecks before they cause outages
Security Incident Response : Detect and respond to attacks in real-time
Performance Optimization : Identify slow queries, inefficient code paths
Business Intelligence : Track platform growth, usage patterns, revenue trends
On-Call Effectiveness : Alert on-call engineers with actionable context
This task establishes the foundation for reliable operations at scale, enabling the team to maintain high availability and performance as the platform grows.
Acceptance Criteria
Prometheus server deployed and collecting metrics from all application nodes
Grafana deployed with data source connected to Prometheus
8+ production dashboards created (System, Resource, Queue, Terraform, Organization, Payment, Security, Business)
Health check endpoint /health returns 200 OK when system healthy
Detailed health check endpoint /health/detailed returns comprehensive diagnostics
HealthCheckService implements 10+ health checks (database, Redis, queue, disk, etc.)
MetricsCollector middleware tracks HTTP request duration and status codes
Custom metrics exported for business events (deployments, licenses, payments)
AlertManager configured with alert rules (critical, warning, info levels)
Alert rules created for critical scenarios (database down, queue stopped, disk full)
Multi-channel alerting configured (email, Slack, PagerDuty)
Alert deduplication and grouping configured
Organization-specific metrics filtered and displayed correctly
Historical data retention configured (30 days detailed, 1 year aggregated)
Dashboard refresh rates optimized (real-time: 5s, historical: 1m)
Grafana authentication integrated with Laravel Sanctum or SSO
API documentation for health check and metrics endpoints
Operational runbook for interpreting alerts and dashboards
Technical Details
File Paths
Health Check System:
/home/topgun/topgun/app/Services/Monitoring/HealthCheckService.php (new)
/home/topgun/topgun/app/Http/Controllers/HealthCheckController.php (new)
/home/topgun/topgun/routes/web.php (modify - add health check routes)
Metrics Collection:
/home/topgun/topgun/app/Services/Monitoring/MetricsCollector.php (new)
/home/topgun/topgun/app/Http/Middleware/CollectMetrics.php (new)
/home/topgun/topgun/app/Console/Commands/ExportMetrics.php (new)
Alert Configuration:
/home/topgun/topgun/app/Services/Monitoring/AlertingService.php (new)
/home/topgun/topgun/config/monitoring.php (new)
Infrastructure (Deployment):
/home/topgun/topgun/docker/prometheus/prometheus.yml (new)
/home/topgun/topgun/docker/prometheus/alerts.yml (new)
/home/topgun/topgun/docker/grafana/provisioning/datasources/prometheus.yml (new)
/home/topgun/topgun/docker/grafana/provisioning/dashboards/ (dashboard JSON files)
/home/topgun/topgun/docker-compose.monitoring.yml (new - monitoring stack)
Documentation:
/home/topgun/topgun/docs/operations/monitoring-guide.md (new)
/home/topgun/topgun/docs/operations/alert-runbook.md (new)
Database Schema
No new database tables required. Existing tables used for metrics:
-- Query for organization metrics
SELECT
organization_id,
COUNT (DISTINCT server_id) as server_count,
COUNT (DISTINCT application_id) as app_count,
SUM (CASE WHEN status = ' running' THEN 1 ELSE 0 END) as running_apps
FROM applications
WHERE deleted_at IS NULL
GROUP BY organization_id;
-- Query for deployment metrics
SELECT
DATE_TRUNC(' hour' , created_at) as hour,
COUNT (* ) as total_deployments,
COUNT (* ) FILTER (WHERE status = ' completed' ) as successful_deployments,
AVG (EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_duration_seconds
FROM terraform_deployments
WHERE created_at >= NOW() - INTERVAL ' 24 hours'
GROUP BY hour
ORDER BY hour DESC ;
HealthCheckService Implementation
File: app/Services/Monitoring/HealthCheckService.php
<?php
namespace App \Services \Monitoring ;
use Illuminate \Support \Facades \DB ;
use Illuminate \Support \Facades \Cache ;
use Illuminate \Support \Facades \Queue ;
use Illuminate \Support \Facades \Redis ;
use Symfony \Component \Process \Process ;
class HealthCheckService
{
private array $ checks = [];
/**
* Run all health checks
*
* @return array Health check results
*/
public function runAll (): array
{
$ results = [
'status ' => 'healthy ' ,
'timestamp ' => now ()->toIso8601String (),
'checks ' => [],
'metadata ' => [
'environment ' => config ('app.env ' ),
'version ' => config ('app.version ' , 'unknown ' ),
],
];
// Run all checks
$ results ['checks ' ]['database ' ] = $ this ->checkDatabase ();
$ results ['checks ' ]['redis ' ] = $ this ->checkRedis ();
$ results ['checks ' ]['queue ' ] = $ this ->checkQueue ();
$ results ['checks ' ]['disk ' ] = $ this ->checkDiskSpace ();
$ results ['checks ' ]['terraform ' ] = $ this ->checkTerraform ();
$ results ['checks ' ]['docker ' ] = $ this ->checkDocker ();
$ results ['checks ' ]['reverb ' ] = $ this ->checkReverb ();
// Determine overall health status
foreach ($ results ['checks ' ] as $ check ) {
if ($ check ['status ' ] === 'unhealthy ' ) {
$ results ['status ' ] = 'unhealthy ' ;
break ;
} elseif ($ check ['status ' ] === 'degraded ' && $ results ['status ' ] === 'healthy ' ) {
$ results ['status ' ] = 'degraded ' ;
}
}
return $ results ;
}
/**
* Check database connectivity and performance
*
* @return array
*/
private function checkDatabase (): array
{
try {
$ start = microtime (true );
// Test connection
DB ::connection ()->getPdo ();
// Test query performance
DB ::table ('organizations ' )->limit (1 )->get ();
$ duration = (microtime (true ) - $ start ) * 1000 ;
// Get connection pool stats
$ connections = DB ::select ('SELECT count(*) as active_connections FROM pg_stat_activity ' );
$ activeConnections = $ connections [0 ]->active_connections ?? 0 ;
$ status = 'healthy ' ;
if ($ duration > 1000 ) {
$ status = 'degraded ' ;
}
return [
'status ' => $ status ,
'message ' => 'Database connection healthy ' ,
'latency_ms ' => round ($ duration , 2 ),
'active_connections ' => $ activeConnections ,
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Database connection failed ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check Redis connectivity and memory usage
*
* @return array
*/
private function checkRedis (): array
{
try {
$ start = microtime (true );
// Test connection
Cache::store ('redis ' )->get ('health-check-test ' );
$ duration = (microtime (true ) - $ start ) * 1000 ;
// Get Redis info
$ redis = Redis::connection ();
$ info = $ redis ->info ('memory ' );
$ usedMemory = $ info ['used_memory_human ' ] ?? 'unknown ' ;
$ maxMemory = $ info ['maxmemory_human ' ] ?? 'unlimited ' ;
$ status = 'healthy ' ;
if ($ duration > 100 ) {
$ status = 'degraded ' ;
}
return [
'status ' => $ status ,
'message ' => 'Redis connection healthy ' ,
'latency_ms ' => round ($ duration , 2 ),
'used_memory ' => $ usedMemory ,
'max_memory ' => $ maxMemory ,
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Redis connection failed ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check queue worker status
*
* @return array
*/
private function checkQueue (): array
{
try {
// Check Horizon status (if available)
$ masters = Cache::get ('illuminate:queue:restart ' );
// Get queue size
$ queueSize = Queue::size ('default ' );
$ terraformQueueSize = Queue::size ('terraform ' );
$ status = 'healthy ' ;
if ($ queueSize > 1000 || $ terraformQueueSize > 50 ) {
$ status = 'degraded ' ;
}
return [
'status ' => $ status ,
'message ' => 'Queue system operational ' ,
'default_queue_size ' => $ queueSize ,
'terraform_queue_size ' => $ terraformQueueSize ,
'horizon_restart ' => $ masters !== null ,
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Queue check failed ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check disk space
*
* @return array
*/
private function checkDiskSpace (): array
{
try {
$ path = base_path ();
$ freeSpace = disk_free_space ($ path );
$ totalSpace = disk_total_space ($ path );
$ percentUsed = 100 - (($ freeSpace / $ totalSpace ) * 100 );
$ status = 'healthy ' ;
if ($ percentUsed > 90 ) {
$ status = 'unhealthy ' ;
} elseif ($ percentUsed > 80 ) {
$ status = 'degraded ' ;
}
return [
'status ' => $ status ,
'message ' => 'Disk space sufficient ' ,
'percent_used ' => round ($ percentUsed , 2 ),
'free_space_gb ' => round ($ freeSpace / (1024 ** 3 ), 2 ),
'total_space_gb ' => round ($ totalSpace / (1024 ** 3 ), 2 ),
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Disk space check failed ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check Terraform binary availability
*
* @return array
*/
private function checkTerraform (): array
{
try {
$ terraformPath = config ('terraform.binary_path ' , '/usr/local/bin/terraform ' );
$ process = new Process ([$ terraformPath , 'version ' , '-json ' ]);
$ process ->run ();
if ($ process ->isSuccessful ()) {
$ output = json_decode ($ process ->getOutput (), true );
return [
'status ' => 'healthy ' ,
'message ' => 'Terraform available ' ,
'version ' => $ output ['terraform_version ' ] ?? 'unknown ' ,
'path ' => $ terraformPath ,
];
}
return [
'status ' => 'degraded ' ,
'message ' => 'Terraform command failed ' ,
'error ' => $ process ->getErrorOutput (),
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Terraform binary not found ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check Docker daemon connectivity
*
* @return array
*/
private function checkDocker (): array
{
try {
$ process = new Process (['docker ' , 'version ' , '--format ' , '{{.Server.Version}} ' ]);
$ process ->run ();
if ($ process ->isSuccessful ()) {
return [
'status ' => 'healthy ' ,
'message ' => 'Docker daemon accessible ' ,
'version ' => trim ($ process ->getOutput ()),
];
}
return [
'status ' => 'degraded ' ,
'message ' => 'Docker command failed ' ,
'error ' => $ process ->getErrorOutput (),
];
} catch (\Exception $ e ) {
return [
'status ' => 'unhealthy ' ,
'message ' => 'Docker daemon not accessible ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Check Reverb WebSocket server
*
* @return array
*/
private function checkReverb (): array
{
try {
// Check if Reverb process is running
$ process = new Process (['pgrep ' , '-f ' , 'reverb:start ' ]);
$ process ->run ();
$ isRunning = $ process ->isSuccessful ();
return [
'status ' => $ isRunning ? 'healthy ' : 'degraded ' ,
'message ' => $ isRunning ? 'Reverb WebSocket server running ' : 'Reverb not detected ' ,
'running ' => $ isRunning ,
];
} catch (\Exception $ e ) {
return [
'status ' => 'degraded ' ,
'message ' => 'Could not check Reverb status ' ,
'error ' => $ e ->getMessage (),
];
}
}
/**
* Get quick health status (for load balancer)
*
* @return bool
*/
public function isHealthy (): bool
{
try {
// Quick checks only
DB ::connection ()->getPdo ();
Cache::store ('redis ' )->get ('health-check-test ' );
return true ;
} catch (\Exception $ e ) {
return false ;
}
}
}
HealthCheckController Implementation
File: app/Http/Controllers/HealthCheckController.php
<?php
namespace App \Http \Controllers ;
use App \Services \Monitoring \HealthCheckService ;
use Illuminate \Http \JsonResponse ;
class HealthCheckController extends Controller
{
public function __construct (
private HealthCheckService $ healthCheckService
) {}
/**
* Simple health check for load balancers
*
* @return JsonResponse
*/
public function index (): JsonResponse
{
if ($ this ->healthCheckService ->isHealthy ()) {
return response ()->json ([
'status ' => 'healthy ' ,
'timestamp ' => now ()->toIso8601String (),
]);
}
return response ()->json ([
'status ' => 'unhealthy ' ,
'timestamp ' => now ()->toIso8601String (),
], 503 );
}
/**
* Detailed health check (authenticated)
*
* @return JsonResponse
*/
public function detailed (): JsonResponse
{
$ results = $ this ->healthCheckService ->runAll ();
$ statusCode = match ($ results ['status ' ]) {
'healthy ' => 200 ,
'degraded ' => 200 ,
'unhealthy ' => 503 ,
default => 500 ,
};
return response ()->json ($ results , $ statusCode );
}
}
MetricsCollector Middleware
File: app/Http/Middleware/CollectMetrics.php
<?php
namespace App \Http \Middleware ;
use Closure ;
use Illuminate \Http \Request ;
use Illuminate \Support \Facades \Cache ;
class CollectMetrics
{
/**
* Handle an incoming request
*
* @param Request $request
* @param Closure $next
* @return mixed
*/
public function handle (Request $ request , Closure $ next ): mixed
{
$ start = microtime (true );
$ response = $ next ($ request );
$ duration = (microtime (true ) - $ start ) * 1000 ;
// Collect metrics
$ this ->recordMetric ([
'type ' => 'http_request ' ,
'method ' => $ request ->method (),
'path ' => $ request ->path (),
'status ' => $ response ->status (),
'duration_ms ' => round ($ duration , 2 ),
'timestamp ' => now ()->timestamp ,
'organization_id ' => $ request ->user ()?->current_organization_id,
]);
return $ response ;
}
/**
* Record metric to Redis for Prometheus scraping
*
* @param array $metric
* @return void
*/
private function recordMetric (array $ metric ): void
{
try {
// Store in Redis list for Prometheus exporter to consume
Cache::store ('redis ' )->rpush ('metrics:http_requests ' , json_encode ($ metric ));
// Trim to last 10000 metrics to prevent unbounded growth
Cache::store ('redis ' )->ltrim ('metrics:http_requests ' , -10000 , -1 );
} catch (\Exception $ e ) {
// Fail silently - don't let metrics collection break requests
\Log::debug ('Failed to record metric ' , ['error ' => $ e ->getMessage ()]);
}
}
}
Configuration File
File: config/monitoring.php
<?php
return [
/*
|--------------------------------------------------------------------------
| Health Check Configuration
|--------------------------------------------------------------------------
*/
'health_checks ' => [
'enabled ' => env ('HEALTH_CHECKS_ENABLED ' , true ),
'cache_ttl ' => env ('HEALTH_CHECK_CACHE_TTL ' , 30 ), // Cache results for 30 seconds
],
/*
|--------------------------------------------------------------------------
| Metrics Collection
|--------------------------------------------------------------------------
*/
'metrics ' => [
'enabled ' => env ('METRICS_COLLECTION_ENABLED ' , true ),
'endpoints ' => [
'http_requests ' => true ,
'queue_jobs ' => true ,
'database_queries ' => false , // Too verbose for production
],
],
/*
|--------------------------------------------------------------------------
| Alerting Configuration
|--------------------------------------------------------------------------
*/
'alerting ' => [
'enabled ' => env ('ALERTING_ENABLED ' , true ),
'channels ' => [
'email ' => [
'enabled ' => env ('ALERT_EMAIL_ENABLED ' , true ),
'to ' => env ('ALERT_EMAIL_TO ' , 'ops@example.com ' ),
],
'slack ' => [
'enabled ' => env ('ALERT_SLACK_ENABLED ' , false ),
'webhook_url ' => env ('ALERT_SLACK_WEBHOOK_URL ' ),
],
'pagerduty ' => [
'enabled ' => env ('ALERT_PAGERDUTY_ENABLED ' , false ),
'integration_key ' => env ('ALERT_PAGERDUTY_KEY ' ),
],
],
'thresholds ' => [
'error_rate ' => env ('ALERT_ERROR_RATE_THRESHOLD ' , 0.01 ), // 1%
'response_time_p95 ' => env ('ALERT_RESPONSE_TIME_P95_MS ' , 1000 ), // 1 second
'queue_depth ' => env ('ALERT_QUEUE_DEPTH_THRESHOLD ' , 1000 ),
'disk_usage_percent ' => env ('ALERT_DISK_USAGE_PERCENT ' , 90 ),
],
],
/*
|--------------------------------------------------------------------------
| Prometheus Configuration
|--------------------------------------------------------------------------
*/
'prometheus ' => [
'enabled ' => env ('PROMETHEUS_ENABLED ' , true ),
'scrape_interval ' => env ('PROMETHEUS_SCRAPE_INTERVAL ' , '15s ' ),
'retention_days ' => env ('PROMETHEUS_RETENTION_DAYS ' , 30 ),
],
/*
|--------------------------------------------------------------------------
| Grafana Configuration
|--------------------------------------------------------------------------
*/
'grafana ' => [
'enabled ' => env ('GRAFANA_ENABLED ' , true ),
'url ' => env ('GRAFANA_URL ' , 'http://grafana:3000 ' ),
'admin_user ' => env ('GRAFANA_ADMIN_USER ' , 'admin ' ),
'admin_password ' => env ('GRAFANA_ADMIN_PASSWORD ' , 'admin ' ),
],
];
Prometheus Configuration
File: docker/prometheus/prometheus.yml
global :
scrape_interval : 15s
evaluation_interval : 15s
external_labels :
cluster : ' coolify-enterprise'
environment : ' production'
# Alertmanager configuration
alerting :
alertmanagers :
- static_configs :
- targets :
- alertmanager:9093
# Load alerting rules
rule_files :
- ' alerts.yml'
# Scrape configurations
scrape_configs :
# Laravel application metrics
- job_name : ' laravel-app'
static_configs :
- targets : ['app:9090']
metrics_path : ' /metrics'
scrape_interval : 15s
# Node Exporter for server metrics
- job_name : ' node-exporter'
static_configs :
- targets : ['node-exporter:9100']
scrape_interval : 30s
# PostgreSQL Exporter
- job_name : ' postgres'
static_configs :
- targets : ['postgres-exporter:9187']
scrape_interval : 30s
# Redis Exporter
- job_name : ' redis'
static_configs :
- targets : ['redis-exporter:9121']
scrape_interval : 30s
# Prometheus self-monitoring
- job_name : ' prometheus'
static_configs :
- targets : ['localhost:9090']
Alert Rules Configuration
File: docker/prometheus/alerts.yml
groups :
- name : critical_alerts
interval : 1m
rules :
# Database down
- alert : DatabaseDown
expr : up{job="postgres"} == 0
for : 1m
labels :
severity : critical
annotations :
summary : " PostgreSQL database is down"
description : " Database {{ $labels.instance }} has been down for more than 1 minute"
# Redis down
- alert : RedisDown
expr : up{job="redis"} == 0
for : 1m
labels :
severity : critical
annotations :
summary : " Redis cache is down"
description : " Redis instance {{ $labels.instance }} is unreachable"
# High disk usage
- alert : HighDiskUsage
expr : (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for : 5m
labels :
severity : critical
annotations :
summary : " Disk space critically low"
description : " Disk usage on {{ $labels.instance }} is above 90% ({{ $value }}%)"
# Queue workers stopped
- alert : QueueWorkersDown
expr : horizon_workers_total == 0
for : 2m
labels :
severity : critical
annotations :
summary : " Queue workers are not running"
description : " No Horizon workers detected for 2 minutes"
- name : warning_alerts
interval : 5m
rules :
# High error rate
- alert : HighErrorRate
expr : rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for : 5m
labels :
severity : warning
annotations :
summary : " High HTTP error rate detected"
description : " Error rate is {{ humanizePercentage $value }} over the last 5 minutes"
# Slow database queries
- alert : SlowDatabaseQueries
expr : histogram_quantile(0.95, rate(database_query_duration_seconds_bucket[5m])) > 1
for : 10m
labels :
severity : warning
annotations :
summary : " Slow database queries detected"
description : " 95th percentile query time is {{ humanizeDuration $value }}"
# High queue depth
- alert : HighQueueDepth
expr : horizon_queue_depth > 1000
for : 10m
labels :
severity : warning
annotations :
summary : " Queue depth is high"
description : " Queue {{ $labels.queue }} has {{ $value }} pending jobs"
# High memory usage
- alert : HighMemoryUsage
expr : (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 20
for : 10m
labels :
severity : warning
annotations :
summary : " Memory usage is high"
description : " Available memory on {{ $labels.instance }} is below 20%"
- name : info_alerts
interval : 15m
rules :
# Deployment completed
- alert : DeploymentCompleted
expr : increase(terraform_deployments_completed_total[15m]) > 0
labels :
severity : info
annotations :
summary : " Infrastructure deployment completed"
description : " {{ $value }} Terraform deployment(s) completed in the last 15 minutes"
# License expiring soon
- alert : LicenseExpiringSoon
expr : (enterprise_license_expiry_timestamp - time()) < 604800
labels :
severity : info
annotations :
summary : " Enterprise license expiring soon"
description : " License for organization {{ $labels.organization }} expires in {{ humanizeDuration $value }}"
Grafana Dashboard Provisioning
File: docker/grafana/provisioning/datasources/prometheus.yml
apiVersion : 1
datasources :
- name : Prometheus
type : prometheus
access : proxy
url : http://prometheus:9090
isDefault : true
editable : false
jsonData :
timeInterval : ' 15s'
queryTimeout : ' 60s'
Routes Configuration
File: routes/web.php (add these routes)
// Health check endpoints
Route::get ('/health ' , [HealthCheckController::class, 'index ' ])
->name ('health ' );
Route::get ('/health/detailed ' , [HealthCheckController::class, 'detailed ' ])
->middleware ('auth:sanctum ' )
->name ('health.detailed ' );
// Metrics endpoint (for Prometheus scraping)
Route::get ('/metrics ' , [MetricsController::class, 'export ' ])
->middleware ('throttle:60,1 ' )
->name ('metrics ' );
Implementation Approach
Step 1: Set Up Health Check System
Create HealthCheckService with all check methods
Create HealthCheckController with simple and detailed endpoints
Register routes in web.php
Test health checks manually
Step 2: Implement Metrics Collection
Create MetricsCollector middleware
Register middleware in Kernel.php
Create MetricsController for Prometheus export
Test metrics collection and export
Step 3: Deploy Prometheus
Create prometheus.yml configuration
Create alerts.yml with alert rules
Add Prometheus to docker-compose.monitoring.yml
Deploy and verify scraping
Step 4: Deploy Grafana
Create datasource provisioning configuration
Create dashboard JSON files (System, Resource, Queue, etc.)
Add Grafana to docker-compose.monitoring.yml
Configure authentication
Step 5: Configure AlertManager
Create alertmanager.yml configuration
Configure notification channels (email, Slack, PagerDuty)
Test alert routing and delivery
Set up alert deduplication
Step 6: Create Dashboards
System Overview Dashboard (general health)
Resource Dashboard (CPU, memory, disk)
Queue Dashboard (Horizon metrics)
Terraform Dashboard (deployment tracking)
Organization Dashboard (per-tenant metrics)
Payment Dashboard (transaction tracking)
Security Dashboard (failed auth, rate limits)
Business Dashboard (KPIs, growth metrics)
Step 7: Integrate with Existing Systems
Add metrics to TerraformDeploymentJob
Add metrics to ResourceMonitoringJob
Add metrics to payment processing
Add metrics to license validation
Step 8: Documentation
Write monitoring guide
Write alert runbook
Document dashboard usage
Create troubleshooting guide
Step 9: Testing
Trigger alerts manually
Verify alert delivery
Test dashboard functionality
Load test metrics collection
Step 10: Deployment and Training
Deploy monitoring stack to production
Train operations team on dashboards
Establish on-call rotation
Document escalation procedures
Test Strategy
Unit Tests
File: tests/Unit/Services/HealthCheckServiceTest.php
<?php
use App \Services \Monitoring \HealthCheckService ;
use Illuminate \Support \Facades \DB ;
use Illuminate \Support \Facades \Cache ;
beforeEach (function () {
$ this ->healthCheckService = app (HealthCheckService::class);
});
it ('returns healthy status when all checks pass ' , function () {
$ result = $ this ->healthCheckService ->runAll ();
expect ($ result ['status ' ])->toBe ('healthy ' );
expect ($ result ['checks ' ])->toHaveKeys ([
'database ' , 'redis ' , 'queue ' , 'disk ' , 'terraform ' , 'docker ' , 'reverb '
]);
});
it ('checks database connectivity ' , function () {
$ result = invade ($ this ->healthCheckService )->checkDatabase ();
expect ($ result )->toHaveKeys (['status ' , 'message ' , 'latency_ms ' ]);
expect ($ result ['status ' ])->toBeIn (['healthy ' , 'degraded ' ]);
});
it ('checks Redis connectivity ' , function () {
$ result = invade ($ this ->healthCheckService )->checkRedis ();
expect ($ result )->toHaveKeys (['status ' , 'message ' , 'latency_ms ' ]);
expect ($ result ['status ' ])->toBeIn (['healthy ' , 'degraded ' ]);
});
it ('detects unhealthy state when database is down ' , function () {
DB ::shouldReceive ('connection->getPdo ' )
->andThrow (new \PDOException ('Connection failed ' ));
$ result = invade ($ this ->healthCheckService )->checkDatabase ();
expect ($ result ['status ' ])->toBe ('unhealthy ' );
expect ($ result )->toHaveKey ('error ' );
});
it ('provides quick health status for load balancers ' , function () {
$ isHealthy = $ this ->healthCheckService ->isHealthy ();
expect ($ isHealthy )->toBeTrue ();
});
Integration Tests
File: tests/Feature/Monitoring/HealthCheckEndpointTest.php
<?php
use App \Models \User ;
it ('returns 200 OK when system is healthy ' , function () {
$ response = $ this ->get ('/health ' );
$ response ->assertOk ();
$ response ->assertJson ([
'status ' => 'healthy ' ,
]);
});
it ('requires authentication for detailed health check ' , function () {
$ response = $ this ->get ('/health/detailed ' );
$ response ->assertUnauthorized ();
});
it ('returns detailed health information when authenticated ' , function () {
$ user = User::factory ()->create ();
$ response = $ this ->actingAs ($ user )
->get ('/health/detailed ' );
$ response ->assertOk ();
$ response ->assertJsonStructure ([
'status ' ,
'timestamp ' ,
'checks ' => [
'database ' ,
'redis ' ,
'queue ' ,
'disk ' ,
],
'metadata ' ,
]);
});
it ('returns 503 when system is unhealthy ' , function () {
// Mock database failure
DB ::shouldReceive ('connection->getPdo ' )
->andThrow (new \PDOException ('Connection failed ' ));
$ response = $ this ->get ('/health ' );
$ response ->assertStatus (503 );
$ response ->assertJson ([
'status ' => 'unhealthy ' ,
]);
});
Metrics Collection Tests
File: tests/Feature/Monitoring/MetricsCollectionTest.php
<?php
use Illuminate \Support \Facades \Cache ;
it ('collects HTTP request metrics ' , function () {
$ this ->get ('/api/organizations ' );
// Check metrics were recorded
$ metrics = Cache::store ('redis ' )->lrange ('metrics:http_requests ' , -1 , -1 );
expect ($ metrics )->not ->toBeEmpty ();
$ metric = json_decode ($ metrics [0 ], true );
expect ($ metric )->toHaveKeys (['type ' , 'method ' , 'path ' , 'status ' , 'duration_ms ' ]);
expect ($ metric ['type ' ])->toBe ('http_request ' );
});
it ('does not break requests when metrics fail ' , function () {
// Simulate Redis failure
Cache::shouldReceive ('rpush ' )
->andThrow (new \Exception ('Redis down ' ));
// Request should still succeed
$ response = $ this ->get ('/api/organizations ' );
$ response ->assertOk ();
});
Alert Testing
Manual Test Plan:
Database Down Alert
Stop PostgreSQL container
Verify alert fires within 1 minute
Verify notification delivery
Restart PostgreSQL
Verify alert resolves
High Disk Usage Alert
Fill disk to >90%
Verify alert fires within 5 minutes
Clean up disk space
Verify alert resolves
High Error Rate Alert
Trigger 500 errors (e.g., break database connection)
Generate traffic to hit 1% error threshold
Verify alert fires
Fix error source
Verify alert resolves
Definition of Done
HealthCheckService implemented with 10+ health checks
HealthCheckController created with simple and detailed endpoints
Health check routes registered and tested
MetricsCollector middleware implemented
Metrics export endpoint created for Prometheus
Prometheus deployed and scraping metrics
Grafana deployed with Prometheus datasource
8+ production dashboards created (System, Resource, Queue, Terraform, Organization, Payment, Security, Business)
AlertManager configured with notification channels
Alert rules created for critical, warning, and info levels
Alerts tested and verified delivery
Organization-specific metrics filtering working
Historical data retention configured (30 days detailed, 1 year aggregated)
Grafana authentication configured
Monitoring configuration documented
Alert runbook created with response procedures
Operations team trained on dashboards
Unit tests written for health checks (>90% coverage)
Integration tests written for endpoints
Manual alert testing completed
Production deployment successful
On-call rotation established
Laravel Pint formatting applied
PHPStan level 5 passing
Code reviewed and approved
Related Tasks
Depends on: Task 89 (CI/CD pipeline for deployment automation)
Integrates with: Task 18 (TerraformDeploymentJob metrics)
Integrates with: Task 24 (ResourceMonitoringJob metrics)
Integrates with: Task 46 (PaymentService metrics)
Integrates with: Task 54 (API rate limiting metrics)
Supports: All production operations and incident response
Task: Create monitoring dashboards and alerting configuration
Description
Implement comprehensive production monitoring and alerting infrastructure for the Coolify Enterprise platform using Laravel, Grafana, Prometheus, and custom health check systems. This task establishes the observability layer that enables proactive incident detection, performance tracking, and operational insights across the entire multi-tenant enterprise deployment.
The Operational Visibility Challenge:
Operating a multi-tenant enterprise platform presents unique monitoring challenges:
Without comprehensive monitoring, production issues remain invisible until customers report them. Silent failures in background jobs, gradual performance degradation, and resource exhaustion can go undetected for hours or days. This task creates the early warning system that transforms reactive firefighting into proactive maintenance.
Solution Architecture:
The monitoring system integrates three complementary layers:
1. Application-Level Metrics (Laravel + Custom Services)
2. Infrastructure Monitoring (Prometheus + Node Exporter)
3. Visualization & Alerting (Grafana + AlertManager)
Key Features:
Production Dashboards (Grafana)
Health Check System (Laravel)
/healthfor load balancer health checks/health/detailed(authenticated)Alert Configuration (Prometheus AlertManager)
Custom Metrics Collection (Laravel Middleware + Jobs)
Log Aggregation (Optional - Preparation for ELK/Loki)
Integration Points:
Existing Infrastructure:
New Components:
Why This Task is Critical:
Monitoring is not optional for production systems—it's the difference between knowing issues exist and discovering them through customer complaints. For multi-tenant enterprise platforms, monitoring becomes even more critical:
This task establishes the foundation for reliable operations at scale, enabling the team to maintain high availability and performance as the platform grows.
Acceptance Criteria
/healthreturns 200 OK when system healthy/health/detailedreturns comprehensive diagnosticsTechnical Details
File Paths
Health Check System:
/home/topgun/topgun/app/Services/Monitoring/HealthCheckService.php(new)/home/topgun/topgun/app/Http/Controllers/HealthCheckController.php(new)/home/topgun/topgun/routes/web.php(modify - add health check routes)Metrics Collection:
/home/topgun/topgun/app/Services/Monitoring/MetricsCollector.php(new)/home/topgun/topgun/app/Http/Middleware/CollectMetrics.php(new)/home/topgun/topgun/app/Console/Commands/ExportMetrics.php(new)Alert Configuration:
/home/topgun/topgun/app/Services/Monitoring/AlertingService.php(new)/home/topgun/topgun/config/monitoring.php(new)Infrastructure (Deployment):
/home/topgun/topgun/docker/prometheus/prometheus.yml(new)/home/topgun/topgun/docker/prometheus/alerts.yml(new)/home/topgun/topgun/docker/grafana/provisioning/datasources/prometheus.yml(new)/home/topgun/topgun/docker/grafana/provisioning/dashboards/(dashboard JSON files)/home/topgun/topgun/docker-compose.monitoring.yml(new - monitoring stack)Documentation:
/home/topgun/topgun/docs/operations/monitoring-guide.md(new)/home/topgun/topgun/docs/operations/alert-runbook.md(new)Database Schema
No new database tables required. Existing tables used for metrics:
HealthCheckService Implementation
File:
app/Services/Monitoring/HealthCheckService.phpHealthCheckController Implementation
File:
app/Http/Controllers/HealthCheckController.phpMetricsCollector Middleware
File:
app/Http/Middleware/CollectMetrics.phpConfiguration File
File:
config/monitoring.phpPrometheus Configuration
File:
docker/prometheus/prometheus.ymlAlert Rules Configuration
File:
docker/prometheus/alerts.ymlGrafana Dashboard Provisioning
File:
docker/grafana/provisioning/datasources/prometheus.ymlRoutes Configuration
File:
routes/web.php(add these routes)Implementation Approach
Step 1: Set Up Health Check System
Step 2: Implement Metrics Collection
Step 3: Deploy Prometheus
Step 4: Deploy Grafana
Step 5: Configure AlertManager
Step 6: Create Dashboards
Step 7: Integrate with Existing Systems
Step 8: Documentation
Step 9: Testing
Step 10: Deployment and Training
Test Strategy
Unit Tests
File:
tests/Unit/Services/HealthCheckServiceTest.phpIntegration Tests
File:
tests/Feature/Monitoring/HealthCheckEndpointTest.phpMetrics Collection Tests
File:
tests/Feature/Monitoring/MetricsCollectionTest.phpAlert Testing
Manual Test Plan:
Database Down Alert
High Disk Usage Alert
High Error Rate Alert
Definition of Done
Related Tasks