PulseWatch is a next-generation AIOps (Artificial Intelligence for IT Operations) platform that automates the detection of anomalies, forecasts system behavior, and visualizes real-time metrics through intelligent dashboards. By combining machine learning, Prometheus-based metric ingestion, and FastAPI services, it enables proactive IT operations and helps teams maintain reliability, stability, and performance at scale.
PulseWatch bridges the gap between data monitoring and intelligent automation, turning raw infrastructure data into actionable, predictive insights. Itβs designed to make IT systems smarter, more autonomous, and self-healing.
- Real-time metric monitoring from Prometheus
- Anomaly detection using LSTM Autoencoders
- Predictive forecasting with Facebook Prophet
- Modular backend built on FastAPI
- Live visualization via Streamlit Dashboards
- Automated model retraining and configurable thresholds
- Automate system health monitoring through intelligent models
- Predict anomalies and performance issues before they occur
- Continuously retrain models to adapt to evolving workloads
- Provide unified visibility through live dashboards
- Enable proactive incident prevention and capacity planning
| Component | Technology | Purpose |
|---|---|---|
| Backend | FastAPI | REST API service layer |
| Dashboard | Streamlit | Visualization and interaction |
| Metric Collection | Prometheus | Real-time metric scraping |
| ML Models | PyTorch LSTM Autoencoder, Facebook Prophet | Anomaly detection & forecasting |
| Data Layer | Pandas, NumPy | Processing and feature engineering |
PulseWatch operates as a closed-loop AIOps pipeline:
- Collect metrics from Prometheus exporters
- Aggregate & preprocess data into unified time-series
- Detect anomalies using LSTM Autoencoders
- Forecast future states using Prophet models
- Visualize & alert through Streamlit dashboards
- Retrain models periodically for continual learning
Directory structure:
βββ sakshikalunge07-aiops/
βββ README.md
βββ PulseWatch-backend/
βββ dashboard.py
βββ model_handler.py
βββ prometheus.yml
βββ requirements.txt
βββ run.py
βββ test.py
βββ user_config.yml
βββ app/
β βββ __init__.py
β βββ config.py
β βββ main.py
β βββ prometheus.py
βββ data/
β βββ merged_metrics.csv
βββ model/
β βββ anamoly_detection.py
β βββ prophet_model.py
β βββ scaler.gz
βββ sample_data/
βββ multivariant_data.csv
βββ univariant_data.csv
- Python 3.9+
- Prometheus installed locally or remotely
- Node Exporter / Windows Exporter running
- (Optional) Docker & Kubernetes for deployment
git clone https://github.com/SakshiKalunge07/AIOps.git
cd sakshikalunge07-aiops/PulseWatch-backendpip install -r requirements.txtEdit prometheus.yml as per your system:
scrape_configs:
- job_name: "linux_exporter"
static_configs:
- targets: ["localhost:9100"]python -m app.mainYour API runs at: http://127.0.0.1:8000
streamlit run dashboard.pyAccess at: http://localhost:8501
Once running, PulseWatch provides three operational modes through the Streamlit dashboard:
| Mode | Description |
|---|---|
| Live Metrics | Displays real-time metrics fetched from Prometheus |
| Long-Term Prediction | Shows forecasted behavior and historical anomalies |
| Test Mode | Runs local anomaly detection using sample data with synthetic spikes |
- Metrics Type Selector: Choose between live, long-term, or test datasets
- Refresh Interval: Adjust real-time polling frequency
- Train Model Button: Trigger backend model retraining
- Anomaly Count Display: View total anomalies detected per session
- Launch backend and Prometheus
- Start Streamlit dashboard
- Select βLive Metricsβ
- Observe anomaly detection results in real time
- Switch to βTest on Sample Dataβ to simulate spikes
- Trigger retraining when performance drifts
All runtime behavior for PulseWatch is controlled via the user_config.yml file located at the project root. This file defines endpoints, model parameters, metric queries, and other operational settings used across both the backend and dashboard.
| Section | Description |
|---|---|
app |
General application metadata like name and version |
prometheus |
Base endpoint for Prometheus queries (usually /api/v1/query_range) |
metrics |
PromQL queries for CPU, Memory, and Latency metrics |
lstm_model |
Configuration for the LSTM Autoencoder used in anomaly detection |
prophet_model |
Forecasting parameters for the Prophet model |
dashboard |
Dashboard behavior, refresh intervals, and display settings |
Defines global identifiers for the project. Used primarily for logging and metadata display.
app:
name: "PulseWatch"
version: "1.0.0"
mode: "production"Specifies the endpoint for querying system metrics. This must point to your active Prometheus server.
prometheus:
url: "http://localhost:9090/api/v1/query_range"
step: "30s" # Query step size (granularity)
window: "1h" # Time range for each data pullEach metric is defined as a PromQL query. You can extend or modify these queries based on your system exporters.
metrics:
cpu: "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle'}[1m])) * 100)"
memory: "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
latency: "rate(node_network_transmit_errs_total[1m])"Controls how the LSTM Autoencoder behaves during anomaly detection and retraining. Tuning these parameters directly affects sensitivity and performance.
lstm_model:
enabled: true
sequence_length: 50 # Number of past data points per training sample
hidden_size: 32 # Size of hidden layer in the LSTM
epochs: 10 # Number of training iterations
retrain_interval: 24 # Retrain model every 24 hours
reconstruction_error_limit: 0.025 # Threshold for marking anomalies
save_path: "models/lstm_model.pt"Defines settings for the forecasting module. Used for trend prediction and anomaly correlation with future patterns.
prophet_model:
enabled: true
forecast_horizon: 60 # Minutes into the future to forecast
retrain_interval: 12 # Retrain every 12 hours
changepoint_prior_scale: 0.1
seasonality_mode: "additive"Defines parameters controlling refresh intervals, display limits, and user-triggered actions.
dashboard:
refresh_interval: 5 # Seconds between metric updates
show_predictions: true
max_points: 500
retrain_button_enabled: true- Threshold tuning (reconstruction_error_limit) directly impacts false positives.
- Retraining intervals should be based on the data drift rate and metric volatility.
- Use shorter Prometheus query steps (15sβ30s) for more granular anomaly detection.
- Always keep save_path within a mounted or persistent volume when running in containers.
The dashboard.py visualizes live metrics and anomalies in real-time.
- Features: Choose between Live, Long-Term, and Test modes. Start or stop Model Training directly from UI. Auto-refresh chart every N seconds.
- Endpoints used:
/metrics/live-short,/metrics/live-long,/metrics/test
Core Routes:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Welcome message |
/metrics/live-short |
GET | Fetch short-term live metrics |
/metrics/live-long |
GET | Long-term predictions |
/metrics/test |
GET | Test data with artificial spikes |
/train |
POST | Retrain LSTM model |
Implementation in app/main.py:
get_short_live_data()β Fetch 20 most recent data pointsget_long_live_data()β Full window analysisget_test_data()β Inject anomalies and evaluate model
File: model_handler.py
The core inference and orchestration layer. Connects data ingestion, ML models (LSTM + Prophet), and backend API.
- Functions:
predict_lstm_from_df(),train_lstm_from_df(),predict_prophet_from_df() - Handles model retraining, persistence, and failure recovery.
PulseWatch uses:
-
LSTM Autoencoder (LSTMAE) for unsupervised anomaly detection: Captures temporal dependencies in multivariate metrics (CPU, memory, latency).
-
Facebook Prophet for forecasting: Handles seasonality and trends for predictive anomaly detection.
- Why Prophet? Robust, explainable, and efficient compared to deep forecasting models.
Model Interaction Flow:
Prometheus Metrics β Preprocessing β LSTMAE (Anomalies) β Prophet (Forecasts) β Dashboard
File: app/prometheus.py
- Functions:
fetch_and_merge_all_metrics()merges CPU, Memory, Latency into a DataFrame;prometheus_to_dataframe()converts JSON to pandas. - Queries defined in
user_config.ymlwith OS-specific variations (Windows/Linux/Darwin).
File: test.py
- Key Function:
inject_spike_anomalies()adds random spikes to simulate anomalies. - Evaluates LSTM and Prophet on modified data; visualizes with Matplotlib.
- Detected anomalies: CPU and memory spikes
- Forecasts: Predictive trends for latency and utilization
- Visualization: Real-time charts with red markers for anomalies
- Integrate database (PostgreSQL / InfluxDB) for persistent storage
- Containerize services (Docker + Kubernetes)
- Add auto-scaling ML model retraining pipeline
- Enhance dashboard with Grafana integration
For more details, refer to the full documentation in the documentation/docs/ folder, built with MkDocs (config in documentation/mkdocs.yml).
