giuto commited on
Commit
94af946
·
1 Parent(s): 1396866

Enhance README with Grafana dashboard configuration, data drift detection details, and usage instructions

Browse files
Files changed (1) hide show
  1. monitoring/README.md +150 -0
monitoring/README.md CHANGED
@@ -55,3 +55,153 @@ We used Better Stack Uptime to monitor the availability of the production deploy
55
  - A failure scenario was tested to confirm Better Stack reports the server error details.
56
 
57
  - Screenshots are available in `monitoring/screenshots/`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  - A failure scenario was tested to confirm Better Stack reports the server error details.
56
 
57
  - Screenshots are available in `monitoring/screenshots/`.
58
+
59
+ ---
60
+
61
+ ## Grafana Dashboard
62
+
63
+ Grafana provides real-time visualization of system metrics and drift detection status.
64
+
65
+ ### Configuration
66
+ - **Port**: `3000`
67
+ - **Credentials**: `admin` / `admin`
68
+ - **Dashboard**: Hopcroft Monitoring Dashboard
69
+ - **Datasource**: Prometheus (auto-provisioned)
70
+ - **Provisioning Files**:
71
+ - Datasources: `grafana/provisioning/datasources/prometheus.yml`
72
+ - Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
73
+ - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
74
+
75
+ ### Dashboard Panels
76
+ 1. **API Request Rate**: Rate of incoming requests per endpoint
77
+ 2. **API Latency**: Average response time per endpoint
78
+ 3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
79
+ 4. **Drift P-Value**: Statistical significance of detected drift
80
+ 5. **Drift Distance**: Kolmogorov-Smirnov distance metric
81
+
82
+ ### Access
83
+ Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
84
+
85
+ ---
86
+
87
+ ## Data Drift Detection
88
+
89
+ Automated distribution shift detection using statistical testing to monitor model input data quality.
90
+
91
+ ### Algorithm
92
+ - **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
93
+ - **Baseline Data**: 1000 samples from training set
94
+ - **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
95
+ - **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
96
+
97
+ ### Scripts
98
+
99
+ #### Baseline Preparation
100
+ **Script**: `drift/scripts/prepare_baseline.py`
101
+
102
+ Functionality:
103
+ - Loads data from SQLite database (`data/raw/skillscope_data.db`)
104
+ - Extracts numeric features only
105
+ - Samples 1000 representative records
106
+ - Saves to `drift/baseline/reference_data.pkl`
107
+
108
+ Usage:
109
+ ```bash
110
+ cd monitoring/drift/scripts
111
+ python prepare_baseline.py
112
+ ```
113
+
114
+ #### Drift Detection
115
+ **Script**: `drift/scripts/run_drift_check.py`
116
+
117
+ Functionality:
118
+ - Loads baseline reference data
119
+ - Compares with new production data
120
+ - Performs KS test on each feature
121
+ - Pushes metrics to Pushgateway
122
+ - Saves results to `drift/reports/`
123
+
124
+ Usage:
125
+ ```bash
126
+ cd monitoring/drift/scripts
127
+ python run_drift_check.py
128
+ ```
129
+
130
+ ### Verification
131
+ Check Pushgateway metrics:
132
+ ```bash
133
+ curl http://localhost:9091/metrics | grep drift
134
+ ```
135
+
136
+ Query in Prometheus:
137
+ ```promql
138
+ drift_detected
139
+ drift_p_value
140
+ drift_distance
141
+ ```
142
+
143
+ ---
144
+
145
+ ## Pushgateway
146
+
147
+ Pushgateway collects metrics from short-lived jobs such as the drift detection script.
148
+
149
+ ### Configuration
150
+ - **Port**: `9091`
151
+ - **Persistence**: Enabled with 5-minute intervals
152
+ - **Data Volume**: `pushgateway-data`
153
+
154
+ ### Metrics Endpoint
155
+ Access metrics at `http://localhost:9091/metrics`
156
+
157
+ ### Integration
158
+ The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
159
+
160
+ ---
161
+
162
+ ## Alerting
163
+
164
+ Alert rules are defined in `prometheus/alert_rules.yml`:
165
+
166
+ - **High Latency**: Triggered when average latency exceeds 2 seconds
167
+ - **High Error Rate**: Triggered when error rate exceeds 5%
168
+ - **Data Drift Detected**: Triggered when drift_detected = 1
169
+
170
+ Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
171
+
172
+ ---
173
+
174
+ ## Complete Stack Usage
175
+
176
+ ### Starting All Services
177
+ ```bash
178
+ # Start all monitoring services
179
+ docker compose up -d
180
+
181
+ # Verify all containers are running
182
+ docker compose ps
183
+
184
+ # Check Prometheus targets
185
+ curl http://localhost:9090/targets
186
+
187
+ # Check Grafana health
188
+ curl http://localhost:3000/api/health
189
+ ```
190
+
191
+ ### Running Drift Detection Workflow
192
+
193
+ 1. **Prepare Baseline (One-time setup)**
194
+ ```bash
195
+ cd monitoring/drift/scripts
196
+ python prepare_baseline.py
197
+ ```
198
+
199
+ 2. **Execute Drift Check**
200
+ ```bash
201
+ python run_drift_check.py
202
+ ```
203
+
204
+ 3. **Verify Results**
205
+ - Check Pushgateway: `http://localhost:9091`
206
+ - Check Prometheus: `http://localhost:9090/graph`
207
+ - Check Grafana: `http://localhost:3000`