Enhance README with Grafana dashboard configuration, data drift detection details, and usage instructions
Browse files- monitoring/README.md +150 -0
monitoring/README.md
CHANGED
|
@@ -55,3 +55,153 @@ We used Better Stack Uptime to monitor the availability of the production deploy
|
|
| 55 |
- A failure scenario was tested to confirm Better Stack reports the server error details.
|
| 56 |
|
| 57 |
- Screenshots are available in `monitoring/screenshots/`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
- A failure scenario was tested to confirm Better Stack reports the server error details.
|
| 56 |
|
| 57 |
- Screenshots are available in `monitoring/screenshots/`.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Grafana Dashboard
|
| 62 |
+
|
| 63 |
+
Grafana provides real-time visualization of system metrics and drift detection status.
|
| 64 |
+
|
| 65 |
+
### Configuration
|
| 66 |
+
- **Port**: `3000`
|
| 67 |
+
- **Credentials**: `admin` / `admin`
|
| 68 |
+
- **Dashboard**: Hopcroft Monitoring Dashboard
|
| 69 |
+
- **Datasource**: Prometheus (auto-provisioned)
|
| 70 |
+
- **Provisioning Files**:
|
| 71 |
+
- Datasources: `grafana/provisioning/datasources/prometheus.yml`
|
| 72 |
+
- Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
|
| 73 |
+
- Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
|
| 74 |
+
|
| 75 |
+
### Dashboard Panels
|
| 76 |
+
1. **API Request Rate**: Rate of incoming requests per endpoint
|
| 77 |
+
2. **API Latency**: Average response time per endpoint
|
| 78 |
+
3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
|
| 79 |
+
4. **Drift P-Value**: Statistical significance of detected drift
|
| 80 |
+
5. **Drift Distance**: Kolmogorov-Smirnov distance metric
|
| 81 |
+
|
| 82 |
+
### Access
|
| 83 |
+
Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## Data Drift Detection
|
| 88 |
+
|
| 89 |
+
Automated distribution shift detection using statistical testing to monitor model input data quality.
|
| 90 |
+
|
| 91 |
+
### Algorithm
|
| 92 |
+
- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
|
| 93 |
+
- **Baseline Data**: 1000 samples from training set
|
| 94 |
+
- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
|
| 95 |
+
- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
|
| 96 |
+
|
| 97 |
+
### Scripts
|
| 98 |
+
|
| 99 |
+
#### Baseline Preparation
|
| 100 |
+
**Script**: `drift/scripts/prepare_baseline.py`
|
| 101 |
+
|
| 102 |
+
Functionality:
|
| 103 |
+
- Loads data from SQLite database (`data/raw/skillscope_data.db`)
|
| 104 |
+
- Extracts numeric features only
|
| 105 |
+
- Samples 1000 representative records
|
| 106 |
+
- Saves to `drift/baseline/reference_data.pkl`
|
| 107 |
+
|
| 108 |
+
Usage:
|
| 109 |
+
```bash
|
| 110 |
+
cd monitoring/drift/scripts
|
| 111 |
+
python prepare_baseline.py
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
#### Drift Detection
|
| 115 |
+
**Script**: `drift/scripts/run_drift_check.py`
|
| 116 |
+
|
| 117 |
+
Functionality:
|
| 118 |
+
- Loads baseline reference data
|
| 119 |
+
- Compares with new production data
|
| 120 |
+
- Performs KS test on each feature
|
| 121 |
+
- Pushes metrics to Pushgateway
|
| 122 |
+
- Saves results to `drift/reports/`
|
| 123 |
+
|
| 124 |
+
Usage:
|
| 125 |
+
```bash
|
| 126 |
+
cd monitoring/drift/scripts
|
| 127 |
+
python run_drift_check.py
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### Verification
|
| 131 |
+
Check Pushgateway metrics:
|
| 132 |
+
```bash
|
| 133 |
+
curl http://localhost:9091/metrics | grep drift
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
Query in Prometheus:
|
| 137 |
+
```promql
|
| 138 |
+
drift_detected
|
| 139 |
+
drift_p_value
|
| 140 |
+
drift_distance
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Pushgateway
|
| 146 |
+
|
| 147 |
+
Pushgateway collects metrics from short-lived jobs such as the drift detection script.
|
| 148 |
+
|
| 149 |
+
### Configuration
|
| 150 |
+
- **Port**: `9091`
|
| 151 |
+
- **Persistence**: Enabled with 5-minute intervals
|
| 152 |
+
- **Data Volume**: `pushgateway-data`
|
| 153 |
+
|
| 154 |
+
### Metrics Endpoint
|
| 155 |
+
Access metrics at `http://localhost:9091/metrics`
|
| 156 |
+
|
| 157 |
+
### Integration
|
| 158 |
+
The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Alerting
|
| 163 |
+
|
| 164 |
+
Alert rules are defined in `prometheus/alert_rules.yml`:
|
| 165 |
+
|
| 166 |
+
- **High Latency**: Triggered when average latency exceeds 2 seconds
|
| 167 |
+
- **High Error Rate**: Triggered when error rate exceeds 5%
|
| 168 |
+
- **Data Drift Detected**: Triggered when drift_detected = 1
|
| 169 |
+
|
| 170 |
+
Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## Complete Stack Usage
|
| 175 |
+
|
| 176 |
+
### Starting All Services
|
| 177 |
+
```bash
|
| 178 |
+
# Start all monitoring services
|
| 179 |
+
docker compose up -d
|
| 180 |
+
|
| 181 |
+
# Verify all containers are running
|
| 182 |
+
docker compose ps
|
| 183 |
+
|
| 184 |
+
# Check Prometheus targets
|
| 185 |
+
curl http://localhost:9090/targets
|
| 186 |
+
|
| 187 |
+
# Check Grafana health
|
| 188 |
+
curl http://localhost:3000/api/health
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### Running Drift Detection Workflow
|
| 192 |
+
|
| 193 |
+
1. **Prepare Baseline (One-time setup)**
|
| 194 |
+
```bash
|
| 195 |
+
cd monitoring/drift/scripts
|
| 196 |
+
python prepare_baseline.py
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
2. **Execute Drift Check**
|
| 200 |
+
```bash
|
| 201 |
+
python run_drift_check.py
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
3. **Verify Results**
|
| 205 |
+
- Check Pushgateway: `http://localhost:9091`
|
| 206 |
+
- Check Prometheus: `http://localhost:9090/graph`
|
| 207 |
+
- Check Grafana: `http://localhost:3000`
|