Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Sleeping

App Files Files Community

antofra10 commited on Dec 24, 2025

Commit

6c56755

unverified ·

2 Parent(s): 73d22a4 4ba57df

Merge pull request #33 from se4ai2526-uniba/Milestone-6

Browse files

Files changed (31) hide show

.gitattributes +1 -0
Dockerfile +3 -0
README.md +42 -0
docker-compose.yml +73 -0
hopcroft_skill_classification_tool_competition/main.py +77 -2
monitoring/README.md +207 -0
monitoring/alertmanager/config.yml +21 -0
monitoring/drift/scripts/prepare_baseline.py +119 -0
monitoring/drift/scripts/requirements.txt +6 -0
monitoring/drift/scripts/run_drift_check.py +216 -0
monitoring/grafana/dashboards/hopcroft_dashboard.json +358 -0
monitoring/grafana/provisioning/dashboards/dashboard.yml +13 -0
monitoring/grafana/provisioning/dashboards/hopcroft_dashboard.json +358 -0
monitoring/grafana/provisioning/datasources/prometheus.yml +14 -0
monitoring/locust/README.md +101 -0
monitoring/locust/locustfile.py +99 -0
monitoring/prometheus/alert_rules.yml +32 -0
monitoring/prometheus/prometheus.yml +32 -0
monitoring/screenshots/incident acknowlege mail.png +3 -0
monitoring/screenshots/incident acknowlege.png +3 -0
monitoring/screenshots/incident mail.png +3 -0
monitoring/screenshots/incident resolved mail.png +3 -0
monitoring/screenshots/incident resolved.png +3 -0
monitoring/screenshots/incident.png +3 -0
monitoring/screenshots/monitors.png +3 -0
nginx.conf +92 -0
reports/alerting_test_report/alerting_report.md +31 -0
reports/alerting_test_report/alertmanager_firing.png +3 -0
reports/alerting_test_report/prometheus_firing.png +3 -0
requirements.txt +3 -0
scripts/start_space.sh +65 -11

.gitattributes CHANGED Viewed

	@@ -1 +1,2 @@

1	*.png filter=lfs diff=lfs merge=lfs -text


1	+ docs/img/*.png filter=lfs diff=lfs merge=lfs -text
2	*.png filter=lfs diff=lfs merge=lfs -text

Dockerfile CHANGED Viewed

@@ -11,6 +11,9 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
 RUN apt-get update && apt-get install -y \
     git \
     dos2unix \
     && rm -rf /var/lib/apt/lists/*
 # Create a non-root user

 RUN apt-get update && apt-get install -y \
     git \
     dos2unix \
+    nginx \
+    procps \
+    curl \
     && rm -rf /var/lib/apt/lists/*
 # Create a non-root user

README.md CHANGED Viewed

@@ -5,6 +5,7 @@ colorFrom: blue
 colorTo: green
 sdk: docker
 app_port: 7860
 ---
 # Hopcroft_Skill-Classification-Tool-Competition
@@ -477,6 +478,47 @@ docker-compose down
 ```
 ## Demo UI (Streamlit)
 The Streamlit GUI provides an interactive web interface for the skill classification API.

 colorTo: green
 sdk: docker
 app_port: 7860
+api_docs_url: /docs
 ---
 # Hopcroft_Skill-Classification-Tool-Competition
 ```
+--------
+## Hugging Face Spaces Deployment
+This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.
+### 1. Setup Space
+1. Create a new Space on Hugging Face.
+2. Select **Docker** as the SDK.
+3. Choose the **Blank** template or upload your code.
+### 2. Configure Secrets
+To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:
+| Name | Type | Description |
+|------|------|-------------|
+| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
+| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |
+> [!IMPORTANT]
+> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.
+### 3. Automated Startup
+The deployment follows this automated flow:
+1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
+2. **scripts/start_space.sh**:
+   - Configures DVC with your secrets.
+   - Pulls models from the DagsHub remote.
+   - Starts the **FastAPI** backend (port 8000).
+   - Starts the **Streamlit** frontend (port 8501).
+   - Starts **Nginx** (port 7860) as a reverse proxy to route traffic.
+### 4. Direct Access
+Once deployed, your Space will be available at:
+`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`
+The API documentation will be accessible at:
+`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`
+--------
 ## Demo UI (Streamlit)
 The Streamlit GUI provides an interactive web interface for the skill classification API.

docker-compose.yml CHANGED Viewed

@@ -47,6 +47,75 @@ services:
         condition: service_healthy
     restart: unless-stopped
 networks:
   hopcroft-net:
     driver: bridge
@@ -54,3 +123,7 @@ networks:
 volumes:
   hopcroft-logs:
     driver: local

         condition: service_healthy
     restart: unless-stopped
+  prometheus:
+    image: prom/prometheus:latest
+    container_name: prometheus
+    volumes:
+      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+      - ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
+    ports:
+      - "9090:9090"
+    networks:
+      - hopcroft-net
+    depends_on:
+      - alertmanager
+    restart: unless-stopped
+  alertmanager:
+    image: prom/alertmanager:latest
+    container_name: alertmanager
+    volumes:
+      - ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
+    ports:
+      - "9093:9093"
+    networks:
+      - hopcroft-net
+    restart: unless-stopped
+  grafana:
+    image: grafana/grafana:latest
+    container_name: grafana
+    ports:
+      - "3000:3000"
+    environment:
+      - GF_SECURITY_ADMIN_USER=admin
+      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_USERS_ALLOW_SIGN_UP=false
+      - GF_SERVER_ROOT_URL=http://localhost:3000
+    volumes:
+      # Provisioning: auto-configure datasources and dashboards
+      - ./monitoring/grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
+      - ./monitoring/grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
+      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
+      # Persistent storage for Grafana data
+      - grafana-data:/var/lib/grafana
+    networks:
+      - hopcroft-net
+    depends_on:
+      - prometheus
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+  pushgateway:
+    image: prom/pushgateway:latest
+    container_name: pushgateway
+    ports:
+      - "9091:9091"
+    networks:
+      - hopcroft-net
+    restart: unless-stopped
+    command:
+      - '--web.listen-address=:9091'
+      - '--persistence.file=/data/pushgateway.data'
+      - '--persistence.interval=5m'
+    volumes:
+      - pushgateway-data:/data
 networks:
   hopcroft-net:
     driver: bridge
 volumes:
   hopcroft-logs:
     driver: local
+  grafana-data:
+    driver: local
+  pushgateway-data:
+    driver: local

hopcroft_skill_classification_tool_competition/main.py CHANGED Viewed

@@ -22,9 +22,17 @@ import os
 import time
 from typing import List
-from fastapi import FastAPI, HTTPException, status
 from fastapi.responses import JSONResponse, RedirectResponse
 import mlflow
 from pydantic import ValidationError
 from hopcroft_skill_classification_tool_competition.api_models import (
@@ -40,6 +48,34 @@ from hopcroft_skill_classification_tool_competition.api_models import (
 from hopcroft_skill_classification_tool_competition.config import MLFLOW_CONFIG
 from hopcroft_skill_classification_tool_competition.modeling.predict import SkillPredictor
 predictor = None
 model_version = "1.0.0"
@@ -85,6 +121,43 @@ app = FastAPI(
 )
 @app.get("/", tags=["Root"])
 async def root():
     """Return basic API information."""
@@ -143,9 +216,11 @@ async def predict_skills(issue: IssueInput) -> PredictionRecord:
         # Combine text fields if needed, or just use issue_text
         # The predictor expects a single string
         full_text = f"{issue.issue_text} {issue.issue_description or ''} {issue.repo_name or ''}"
-        predictions_data = predictor.predict(full_text)
         # Convert to Pydantic models
         predictions = [

 import time
 from typing import List
+from fastapi import FastAPI, HTTPException, status, Request, Response
 from fastapi.responses import JSONResponse, RedirectResponse
 import mlflow
+from prometheus_client import (
+    CONTENT_TYPE_LATEST,
+    Counter,
+    Gauge,
+    Histogram,
+    Summary,
+    generate_latest,
+)
 from pydantic import ValidationError
 from hopcroft_skill_classification_tool_competition.api_models import (
 from hopcroft_skill_classification_tool_competition.config import MLFLOW_CONFIG
 from hopcroft_skill_classification_tool_competition.modeling.predict import SkillPredictor
+# Define Prometheus Metrics
+# Counter: Total number of requests
+REQUESTS_TOTAL = Counter(
+    "hopcroft_requests_total",
+    "Total number of requests",
+    ["method", "endpoint", "http_status"],
+)
+# Histogram: Request duration
+REQUEST_DURATION_SECONDS = Histogram(
+    "hopcroft_request_duration_seconds",
+    "Request duration in seconds",
+    ["method", "endpoint"],
+)
+# Gauge: In-progress requests
+IN_PROGRESS_REQUESTS = Gauge(
+    "hopcroft_in_progress_requests",
+    "Number of requests currently in progress",
+    ["method", "endpoint"],
+)
+# Summary: Model prediction time
+MODEL_PREDICTION_SECONDS = Summary(
+    "hopcroft_prediction_processing_seconds",
+    "Time spent processing model predictions",
+)
 predictor = None
 model_version = "1.0.0"
 )
+@app.middleware("http")
+async def monitor_requests(request: Request, call_next):
+    """Middleware to collect Prometheus metrics for each request."""
+    method = request.method
+    # Use a simplified path or template if possible to avoid high cardinality
+    # For now, using request.url.path is acceptable for this scale
+    endpoint = request.url.path
+    IN_PROGRESS_REQUESTS.labels(method=method, endpoint=endpoint).inc()
+    start_time = time.time()
+    try:
+        response = await call_next(request)
+        status_code = response.status_code
+        REQUESTS_TOTAL.labels(
+            method=method, endpoint=endpoint, http_status=status_code
+        ).inc()
+        return response
+    except Exception as e:
+        REQUESTS_TOTAL.labels(
+            method=method, endpoint=endpoint, http_status=500
+        ).inc()
+        raise e
+    finally:
+        duration = time.time() - start_time
+        REQUEST_DURATION_SECONDS.labels(method=method, endpoint=endpoint).observe(
+            duration
+        )
+        IN_PROGRESS_REQUESTS.labels(method=method, endpoint=endpoint).dec()
+@app.get("/metrics", tags=["Observability"])
+async def metrics():
+    """Expose Prometheus metrics."""
+    return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)
 @app.get("/", tags=["Root"])
 async def root():
     """Return basic API information."""
         # Combine text fields if needed, or just use issue_text
         # The predictor expects a single string
+        # The predictor expects a single string
         full_text = f"{issue.issue_text} {issue.issue_description or ''} {issue.repo_name or ''}"
+        with MODEL_PREDICTION_SECONDS.time():
+            predictions_data = predictor.predict(full_text)
         # Convert to Pydantic models
         predictions = [

monitoring/README.md ADDED Viewed

	@@ -0,0 +1,207 @@

+# Metrics Collection & Verification
+This directory contains the configuration for Prometheus monitoring.
+## Configuration
+- **Prometheus Config**: `prometheus/prometheus.yml`
+- **Scrape Target**: `hopcroft-api:8080`
+- **Metrics Endpoint**: `http://localhost:8080/metrics`
+## Verification Queries (PromQL)
+You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`):
+### 1. Request Rate (Counter)
+Shows the rate of requests per second over the last minute.
+```promql
+rate(hopcroft_requests_total[1m])
+```
+### 2. Average Request Duration (Histogram)
+Calculates average latency.
+```promql
+rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])
+```
+### 3. Current In-Progress Requests (Gauge)
+Shows how many requests are currently being processed.
+```promql
+hopcroft_in_progress_requests
+```
+### 4. Model Prediction Time (Summary)
+Shows the 90th percentile of model prediction time.
+```promql
+hopcroft_prediction_processing_seconds{quantile="0.9"}
+```
+---
+## Uptime Monitoring (Better Stack)
+We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.
+**Base URL**
+- https://dacrow13-hopcroft-skill-classification.hf.space
+**Monitored endpoints**
+- https://dacrow13-hopcroft-skill-classification.hf.space/health
+- https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json
+- https://dacrow13-hopcroft-skill-classification.hf.space/docs
+**Checks and alerts**
+- Monitors are configured to run from multiple locations.
+- Email notifications are enabled for failures.
+- A failure scenario was tested to confirm Better Stack reports the server error details.
+- Screenshots are available in `monitoring/screenshots/`.
+---
+## Grafana Dashboard
+Grafana provides real-time visualization of system metrics and drift detection status.
+### Configuration
+- **Port**: `3000`
+- **Credentials**: `admin` / `admin`
+- **Dashboard**: Hopcroft Monitoring Dashboard
+- **Datasource**: Prometheus (auto-provisioned)
+- **Provisioning Files**:
+  - Datasources: `grafana/provisioning/datasources/prometheus.yml`
+  - Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
+  - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
+### Dashboard Panels
+1. **API Request Rate**: Rate of incoming requests per endpoint
+2. **API Latency**: Average response time per endpoint
+3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
+4. **Drift P-Value**: Statistical significance of detected drift
+5. **Drift Distance**: Kolmogorov-Smirnov distance metric
+### Access
+Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
+---
+## Data Drift Detection
+Automated distribution shift detection using statistical testing to monitor model input data quality.
+### Algorithm
+- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
+- **Baseline Data**: 1000 samples from training set
+- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
+- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
+### Scripts
+#### Baseline Preparation
+**Script**: `drift/scripts/prepare_baseline.py`
+Functionality:
+- Loads data from SQLite database (`data/raw/skillscope_data.db`)
+- Extracts numeric features only
+- Samples 1000 representative records
+- Saves to `drift/baseline/reference_data.pkl`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+#### Drift Detection
+**Script**: `drift/scripts/run_drift_check.py`
+Functionality:
+- Loads baseline reference data
+- Compares with new production data
+- Performs KS test on each feature
+- Pushes metrics to Pushgateway
+- Saves results to `drift/reports/`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python run_drift_check.py
+```
+### Verification
+Check Pushgateway metrics:
+```bash
+curl http://localhost:9091/metrics | grep drift
+```
+Query in Prometheus:
+```promql
+drift_detected
+drift_p_value
+drift_distance
+```
+---
+## Pushgateway
+Pushgateway collects metrics from short-lived jobs such as the drift detection script.
+### Configuration
+- **Port**: `9091`
+- **Persistence**: Enabled with 5-minute intervals
+- **Data Volume**: `pushgateway-data`
+### Metrics Endpoint
+Access metrics at `http://localhost:9091/metrics`
+### Integration
+The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
+---
+## Alerting
+Alert rules are defined in `prometheus/alert_rules.yml`:
+- **High Latency**: Triggered when average latency exceeds 2 seconds
+- **High Error Rate**: Triggered when error rate exceeds 5%
+- **Data Drift Detected**: Triggered when drift_detected = 1
+Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
+---
+## Complete Stack Usage
+### Starting All Services
+```bash
+# Start all monitoring services
+docker compose up -d
+# Verify all containers are running
+docker compose ps
+# Check Prometheus targets
+curl http://localhost:9090/targets
+# Check Grafana health
+curl http://localhost:3000/api/health
+```
+### Running Drift Detection Workflow
+1. **Prepare Baseline (One-time setup)**
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+2. **Execute Drift Check**
+```bash
+python run_drift_check.py
+```
+3. **Verify Results**
+   - Check Pushgateway: `http://localhost:9091`
+   - Check Prometheus: `http://localhost:9090/graph`
+   - Check Grafana: `http://localhost:3000`

monitoring/alertmanager/config.yml ADDED Viewed

	@@ -0,0 +1,21 @@

+global:
+  resolve_timeout: 5m
+route:
+  group_by: ['alertname', 'severity']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 1h
+  receiver: 'log-receiver'
+receivers:
+  - name: 'log-receiver'
+    webhook_configs:
+      - url: 'http://hopcroft-api:8080/health'
+inhibition_rules:
+  - source_match:
+      severity: 'critical'
+    target_match:
+      severity: 'warning'
+    equal: ['alertname', 'dev', 'instance']

monitoring/drift/scripts/prepare_baseline.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Prepare baseline/reference data for drift detection.
+This script samples representative data from the training set.
+"""
+import pickle
+import pandas as pd
+import numpy as np
+import sqlite3
+from pathlib import Path
+from sklearn.model_selection import train_test_split
+# Paths
+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+BASELINE_DIR = Path(__file__).parent.parent / "baseline"
+BASELINE_DIR.mkdir(parents=True, exist_ok=True)
+def load_training_data():
+    """Load the original training dataset from SQLite database."""
+    # Load from SQLite database
+    db_path = PROJECT_ROOT / "data" / "raw" / "skillscope_data.db"
+    if not db_path.exists():
+        raise FileNotFoundError(f"Database not found at {db_path}")
+    print(f"Loading data from database: {db_path}")
+    conn = sqlite3.connect(db_path)
+    # Load from the main table
+    query = "SELECT * FROM nlbse_tool_competition_data_by_issue LIMIT 10000"
+    df = pd.read_sql_query(query, conn)
+    conn.close()
+    print(f"Loaded {len(df)} training samples")
+    return df
+def prepare_baseline(df, sample_size=1000, random_state=42):
+    """
+    Sample representative baseline data.
+    Args:
+        df: Training dataframe
+        sample_size: Number of samples for baseline
+        random_state: Random seed for reproducibility
+    Returns:
+        Baseline dataframe
+    """
+    # Stratified sampling if you have labels
+    if 'label' in df.columns:
+        _, baseline_df = train_test_split(
+            df,
+            test_size=sample_size,
+            random_state=random_state,
+            stratify=df['label']
+        )
+    else:
+        baseline_df = df.sample(n=min(sample_size, len(df)), random_state=random_state)
+    print(f"Sampled {len(baseline_df)} baseline samples")
+    return baseline_df
+def extract_features(df):
+    """
+    Extract features used for drift detection.
+    Should match the features used by your model.
+    """
+    # Select only numeric columns, exclude labels and IDs
+    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
+    exclude_cols = ['label', 'id', 'timestamp', 'issue_id', 'file_id', 'method_id', 'class_id']
+    feature_columns = [col for col in numeric_cols if col not in exclude_cols]
+    X = df[feature_columns].values
+    print(f"Extracted {X.shape[1]} numeric features from {X.shape[0]} samples")
+    return X
+def save_baseline(baseline_data, filename="reference_data.pkl"):
+    """Save baseline data to disk."""
+    baseline_path = BASELINE_DIR / filename
+    with open(baseline_path, 'wb') as f:
+        pickle.dump(baseline_data, f)
+    print(f"Baseline saved to {baseline_path}")
+    print(f"   Shape: {baseline_data.shape}")
+    print(f"   Size: {baseline_path.stat().st_size / 1024:.2f} KB")
+def main():
+    """Main execution."""
+    print("=" * 60)
+    print("Preparing Baseline Data for Drift Detection")
+    print("=" * 60)
+    # Load data
+    df = load_training_data()
+    # Sample baseline
+    baseline_df = prepare_baseline(df, sample_size=1000)
+    # Extract features
+    X_baseline = extract_features(baseline_df)
+    # Save
+    save_baseline(X_baseline)
+    print("\n" + "=" * 60)
+    print("Baseline preparation complete!")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

monitoring/drift/scripts/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+alibi-detect>=0.11.4
+pandas>=2.0.0
+numpy>=1.24.0
+scikit-learn>=1.3.0
+requests>=2.31.0
+mlflow>=2.8.0

monitoring/drift/scripts/run_drift_check.py ADDED Viewed

	@@ -0,0 +1,216 @@

+"""
+Data Drift Detection using Scipy KS Test.
+Detects distribution shifts between baseline and new data.
+"""
+import pickle
+import json
+import requests
+import numpy as np
+import pandas as pd
+from pathlib import Path
+from datetime import datetime
+from scipy.stats import ks_2samp
+from typing import Dict, Tuple
+# Configuration
+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+BASELINE_DIR = Path(__file__).parent.parent / "baseline"
+REPORTS_DIR = Path(__file__).parent.parent / "reports"
+REPORTS_DIR.mkdir(parents=True, exist_ok=True)
+PUSHGATEWAY_URL = "http://localhost:9091"
+P_VALUE_THRESHOLD = 0.05  # Significance level
+def load_baseline() -> np.ndarray:
+    """Load reference/baseline data."""
+    baseline_path = BASELINE_DIR / "reference_data.pkl"
+    if not baseline_path.exists():
+        raise FileNotFoundError(
+            f"Baseline data not found at {baseline_path}\n"
+            f"Run `python prepare_baseline.py` first!"
+        )
+    with open(baseline_path, 'rb') as f:
+        X_baseline = pickle.load(f)
+    print(f"Loaded baseline data: {X_baseline.shape}")
+    return X_baseline
+def load_new_data() -> np.ndarray:
+    """
+    Load new/production data to check for drift.
+    In production, this would fetch from:
+    - Database
+    - S3 bucket
+    - API logs
+    - Data lake
+    For now, simulate or load from file.
+    """
+    # Option 1: Load from file
+    data_path = PROJECT_ROOT / "data" / "test.csv"
+    if data_path.exists():
+        df = pd.read_csv(data_path)
+        # Extract same features as baseline
+        feature_columns = [col for col in df.columns if col not in ['label', 'id', 'timestamp']]
+        X_new = df[feature_columns].values[:500]  # Take 500 samples
+        print(f"Loaded new data from file: {X_new.shape}")
+        return X_new
+    # Option 2: Simulate (for testing)
+    print("Simulating new data (no test file found)")
+    X_baseline = load_baseline()
+    # Add slight shift to simulate drift
+    X_new = X_baseline[:500] + np.random.normal(0, 0.1, (500, X_baseline.shape[1]))
+    return X_new
+def run_drift_detection(X_baseline: np.ndarray, X_new: np.ndarray) -> Dict:
+    """
+    Run Kolmogorov-Smirnov drift detection using scipy.
+    Args:
+        X_baseline: Reference data
+        X_new: New data to check
+    Returns:
+        Drift detection results
+    """
+    print("\n" + "=" * 60)
+    print("Running Drift Detection (Kolmogorov-Smirnov Test)")
+    print("=" * 60)
+    # Run KS test for each feature
+    p_values = []
+    distances = []
+    for i in range(X_baseline.shape[1]):
+        statistic, p_value = ks_2samp(X_baseline[:, i], X_new[:, i])
+        p_values.append(p_value)
+        distances.append(statistic)
+    # Aggregate results
+    min_p_value = np.min(p_values)
+    max_distance = np.max(distances)
+    # Apply Bonferroni correction for multiple testing
+    adjusted_threshold = P_VALUE_THRESHOLD / X_baseline.shape[1]
+    drift_detected = min_p_value < adjusted_threshold
+    # Extract results
+    results = {
+        "timestamp": datetime.now().isoformat(),
+        "drift_detected": int(drift_detected),
+        "p_value": float(min_p_value),
+        "threshold": adjusted_threshold,
+        "distance": float(max_distance),
+        "baseline_samples": X_baseline.shape[0],
+        "new_samples": X_new.shape[0],
+        "num_features": X_baseline.shape[1]
+    }
+    # Print results
+    print(f"\nResults:")
+    print(f"   Drift Detected: {'YES' if results['drift_detected'] else 'NO'}")
+    print(f"   P-Value: {results['p_value']:.6f} (adjusted threshold: {adjusted_threshold:.6f})")
+    print(f"   Distance: {results['distance']:.6f}")
+    print(f"   Baseline: {X_baseline.shape[0]} samples")
+    print(f"   New Data: {X_new.shape[0]} samples")
+    return results
+def save_report(results: Dict):
+    """Save drift detection report to file."""
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    report_path = REPORTS_DIR / f"drift_report_{timestamp}.json"
+    with open(report_path, 'w') as f:
+        json.dump(results, f, indent=2)
+    print(f"\nReport saved to: {report_path}")
+def push_to_prometheus(results: Dict):
+    """
+    Push drift metrics to Prometheus via Pushgateway.
+    This allows Prometheus to scrape short-lived job metrics.
+    """
+    metrics = f"""# TYPE drift_detected gauge
+# HELP drift_detected Whether data drift was detected (1=yes, 0=no)
+drift_detected {results['drift_detected']}
+# TYPE drift_p_value gauge
+# HELP drift_p_value P-value from drift detection test
+drift_p_value {results['p_value']}
+# TYPE drift_distance gauge
+# HELP drift_distance Statistical distance between distributions
+drift_distance {results['distance']}
+# TYPE drift_check_timestamp gauge
+# HELP drift_check_timestamp Unix timestamp of last drift check
+drift_check_timestamp {datetime.now().timestamp()}
+"""
+    try:
+        response = requests.post(
+            f"{PUSHGATEWAY_URL}/metrics/job/drift_detection/instance/hopcroft",
+            data=metrics,
+            headers={'Content-Type': 'text/plain'}
+        )
+        response.raise_for_status()
+        print(f"Metrics pushed to Pushgateway at {PUSHGATEWAY_URL}")
+    except requests.exceptions.RequestException as e:
+        print(f"Failed to push to Pushgateway: {e}")
+        print(f"   Make sure Pushgateway is running: docker compose ps pushgateway")
+def main():
+    """Main execution."""
+    print("\n" + "=" * 60)
+    print("Hopcroft Data Drift Detection")
+    print("=" * 60)
+    try:
+        # Load data
+        X_baseline = load_baseline()
+        X_new = load_new_data()
+        # Run drift detection
+        results = run_drift_detection(X_baseline, X_new)
+        # Save report
+        save_report(results)
+        # Push to Prometheus
+        push_to_prometheus(results)
+        print("\n" + "=" * 60)
+        print("Drift Detection Complete!")
+        print("=" * 60)
+        if results['drift_detected']:
+            print("\nWARNING: Data drift detected!")
+            print(f"   P-value: {results['p_value']:.6f} < {P_VALUE_THRESHOLD}")
+            return 1
+        else:
+            print("\nNo significant drift detected")
+            return 0
+    except Exception as e:
+        print(f"\nError: {e}")
+        import traceback
+        traceback.print_exc()
+        return 1
+if __name__ == "__main__":
+    exit(main())

monitoring/grafana/dashboards/hopcroft_dashboard.json ADDED Viewed

	@@ -0,0 +1,358 @@

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "reqps"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "rate(fastapi_requests_total[1m])",
+          "refId": "A"
+        }
+      ],
+      "title": "Request Rate",
+      "type": "gauge",
+      "description": "Number of requests per second handled by the API"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "ms"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 18,
+        "x": 6,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p50 (median)",
+          "refId": "B"
+        }
+      ],
+      "title": "Request Latency (p50, p95)",
+      "type": "timeseries",
+      "description": "API response time percentiles over time"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "No Drift"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Drift Detected"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          }
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_detected",
+          "refId": "A"
+        }
+      ],
+      "title": "Data Drift Status",
+      "type": "stat",
+      "description": "Current data drift detection status (1 = drift detected, 0 = no drift)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "decimals": 4,
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 0.01
+              },
+              {
+                "color": "red",
+                "value": 0.05
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 6,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_p_value",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift P-Value",
+      "type": "stat",
+      "description": "Statistical significance of detected drift (lower = more significant)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "auto",
+            "spanNulls": false
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_distance",
+          "legendFormat": "Distance",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift Distance Over Time",
+      "type": "timeseries",
+      "description": "Statistical distance between baseline and current data distribution"
+    }
+  ],
+  "refresh": "10s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["hopcroft", "ml", "monitoring"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Hopcroft ML Model Monitoring",
+  "uid": "hopcroft-ml-dashboard",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/provisioning/dashboards/dashboard.yml ADDED Viewed

	@@ -0,0 +1,13 @@

+apiVersion: 1
+providers:
+  - name: 'Hopcroft Dashboards'
+    orgId: 1
+    folder: ''
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /var/lib/grafana/dashboards
+      foldersFromFilesStructure: true

monitoring/grafana/provisioning/dashboards/hopcroft_dashboard.json ADDED Viewed

	@@ -0,0 +1,358 @@

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "reqps"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "rate(fastapi_requests_total[1m])",
+          "refId": "A"
+        }
+      ],
+      "title": "Request Rate",
+      "type": "gauge",
+      "description": "Number of requests per second handled by the API"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "ms"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 18,
+        "x": 6,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p50 (median)",
+          "refId": "B"
+        }
+      ],
+      "title": "Request Latency (p50, p95)",
+      "type": "timeseries",
+      "description": "API response time percentiles over time"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "No Drift"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Drift Detected"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          }
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_detected",
+          "refId": "A"
+        }
+      ],
+      "title": "Data Drift Status",
+      "type": "stat",
+      "description": "Current data drift detection status (1 = drift detected, 0 = no drift)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "decimals": 4,
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 0.01
+              },
+              {
+                "color": "red",
+                "value": 0.05
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 6,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_p_value",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift P-Value",
+      "type": "stat",
+      "description": "Statistical significance of detected drift (lower = more significant)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "auto",
+            "spanNulls": false
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_distance",
+          "legendFormat": "Distance",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift Distance Over Time",
+      "type": "timeseries",
+      "description": "Statistical distance between baseline and current data distribution"
+    }
+  ],
+  "refresh": "10s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["hopcroft", "ml", "monitoring"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Hopcroft ML Model Monitoring",
+  "uid": "hopcroft-ml-dashboard",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/provisioning/datasources/prometheus.yml ADDED Viewed

	@@ -0,0 +1,14 @@

+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    uid: prometheus
+    orgId: 1
+    url: http://prometheus:9090
+    isDefault: true
+    editable: true
+    jsonData:
+      httpMethod: POST
+      timeInterval: "15s"

monitoring/locust/README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+# Locust Load Testing - Skill Classification API
+Questa directory contiene gli script per il load testing della Skill Classification API utilizzando [Locust](https://locust.io/).
+## Prerequisiti
+Assicurati di avere Python installato e installa Locust:
+```bash
+pip install locust
+```
+## Avvio del Test
+### 1. Avvia Locust
+Dalla directory `monitoring/locust/`, esegui:
+```bash
+locust -f locustfile.py
+```
+### 2. Accedi alla Web UI
+Apri il browser e vai a: **http://localhost:8089**
+### 3. Configura il Test
+Nella Web UI, configura i seguenti parametri:
+| Parametro | Descrizione | Valore Consigliato |
+|-----------|-------------|-------------------|
+| **Host** | URL dell'API da testare | `http://localhost:8080` (Docker) o `http://localhost:8000` (locale) |
+| **Number of users** | Numero totale di utenti simulati | 10-100 |
+| **Spawn rate** | Utenti da creare al secondo | 1-10 |
+### 4. Avvia il Test
+Clicca su **"Start swarming"** per avviare il test di carico.
+## Task Implementati
+Lo script simula il comportamento di utenti reali con i seguenti task:
+| Task | Endpoint | Metodo | Peso | Descrizione |
+|------|----------|--------|------|-------------|
+| **Predizione Singola** | `/predict` | POST | 3 | Classifica un singolo issue text. Task principale, eseguito più frequentemente. |
+| **Predizione Batch** | `/predict/batch` | POST | 1 | Classifica multipli issue text in una singola richiesta. |
+| **Monitoraggio e Storia** | `/predictions`, `/health` | GET | 1 | Visualizza la cronologia delle predizioni e verifica lo stato del sistema. |
+### Distribuzione dei Pesi
+Con i pesi configurati (3:1:1), la distribuzione approssimativa delle richieste è:
+- **60%** - Predizione Singola
+- **20%** - Predizione Batch
+- **20%** - Monitoraggio e Storia
+### Tempo di Attesa
+Ogni utente attende tra **1 e 5 secondi** tra un task e l'altro per simulare un comportamento realistico.
+## Metriche Monitorate
+Durante il test, Locust fornisce le seguenti metriche in tempo reale:
+- **RPS (Requests Per Second)**: Numero di richieste al secondo
+- **Response Time**: Tempo medio/mediano/percentili di risposta
+- **Failure Rate**: Percentuale di richieste fallite
+- **Active Users**: Numero di utenti attualmente attivi
+## Opzioni Avanzate
+### Esecuzione Headless (senza UI)
+```bash
+locust -f locustfile.py --headless -u 50 -r 5 -t 5m --host http://localhost:8000
+```
+| Opzione | Descrizione |
+|---------|-------------|
+| `--headless` | Esegui senza Web UI |
+| `-u 50` | 50 utenti simulati |
+| `-r 5` | 5 utenti creati al secondo |
+| `-t 5m` | Durata del test: 5 minuti |
+| `--host` | URL dell'API |
+### Esportazione Risultati
+```bash
+locust -f locustfile.py --headless -u 50 -r 5 -t 5m --host http://localhost:8000 --csv=results
+```
+Questo creerà file CSV con i risultati del test.
+## Struttura File
+```
+monitoring/locust/
+├── locustfile.py   # Script principale di load testing
+└── README.md       # Questa documentazione
+```

monitoring/locust/locustfile.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""
+Locust Load Testing Script for Skill Classification API
+This script defines user behavior for load testing the prediction and monitoring
+endpoints of the Skill Classification API.
+"""
+from locust import HttpUser, task, between
+class SkillClassificationUser(HttpUser):
+    """
+    Simulated user for load testing the Skill Classification API.
+    This user performs the following actions:
+    - Single predictions (most frequent)
+    - Batch predictions
+    - Monitoring and health checks
+    """
+    # Default host for the API (can be overridden via --host flag or Web UI)
+    # Use http://localhost:8080 for Docker or http://localhost:8000 for local dev
+    host = "http://localhost:8080"
+    # Wait between 1 and 5 seconds between tasks to simulate real user behavior
+    wait_time = between(1, 5)
+    @task(3)
+    def predict_single(self):
+        """
+        Task 1: Single Prediction (Weight: 3)
+        Performs a POST request to /predict with a single issue text.
+        This is the main task and executes more frequently due to higher weight.
+        """
+        payload = {
+            "issue_text": "Fix authentication bug in login module"
+        }
+        with self.client.post(
+            "/predict",
+            json=payload,
+            catch_response=True
+        ) as response:
+            if response.status_code == 201:
+                response.success()
+            else:
+                response.failure(f"Prediction failed with status {response.status_code}")
+    @task(1)
+    def predict_batch(self):
+        """
+        Task 2: Batch Prediction (Weight: 1)
+        Performs a POST request to /predict/batch with multiple issue texts.
+        """
+        payload = {
+            "issues": [
+                {"issue_text": "Test 1"},
+                {"issue_text": "Test 2"}
+            ]
+        }
+        with self.client.post(
+            "/predict/batch",
+            json=payload,
+            catch_response=True
+        ) as response:
+            if response.status_code == 200:
+                response.success()
+            else:
+                response.failure(f"Batch prediction failed with status {response.status_code}")
+    @task(1)
+    def monitoring_and_history(self):
+        """
+        Task 3: Monitoring and History (Weight: 1)
+        Performs GET requests to check prediction history and system health.
+        """
+        # Check prediction history
+        with self.client.get(
+            "/predictions",
+            catch_response=True
+        ) as response:
+            if 200 <= response.status_code < 300:
+                response.success()
+            else:
+                response.failure(f"Predictions history failed with status {response.status_code}")
+        # Check system health
+        with self.client.get(
+            "/health",
+            catch_response=True
+        ) as response:
+            if 200 <= response.status_code < 300:
+                response.success()
+            else:
+                response.failure(f"Health check failed with status {response.status_code}")

monitoring/prometheus/alert_rules.yml ADDED Viewed

	@@ -0,0 +1,32 @@

+groups:
+  - name: hopcroft_alerts
+    rules:
+      - alert: ServiceDown
+        expr: up == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Service {{ $labels.instance }} is down"
+          description: "The job {{ $labels.job }} has been down for more than 1 minute."
+      - alert: HighErrorRate
+        expr: |
+          sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
+          /
+          sum(rate(hopcroft_requests_total[5m])) > 0.1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High error rate on {{ $labels.instance }}"
+          description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
+      - alert: SlowRequests
+        expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Slow requests on {{ $labels.endpoint }}"
+          description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."

monitoring/prometheus/prometheus.yml ADDED Viewed

	@@ -0,0 +1,32 @@

+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+  external_labels:
+    monitor: 'hopcroft-monitor'
+    environment: 'development'
+rule_files:
+  - "alert_rules.yml"
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets:
+          - 'alertmanager:9093'
+scrape_configs:
+  - job_name: 'hopcroft-api'
+    metrics_path: '/metrics'
+    static_configs:
+      - targets: ['hopcroft-api:8080']
+    scrape_interval: 10s
+  - job_name: 'prometheus'
+    static_configs:
+      - targets: ['localhost:9090']
+  - job_name: 'pushgateway'
+    honor_labels: true
+    static_configs:
+      - targets: ['pushgateway:9091']
+    scrape_interval: 30s

monitoring/screenshots/incident acknowlege mail.png ADDED Viewed

Git LFS Details

SHA256: 0447e51deb21dcea0b8d9b1662f6c550324d905515077c516f6cb8a7cdd17e92
Pointer size: 131 Bytes
Size of remote file: 103 kB

monitoring/screenshots/incident acknowlege.png ADDED Viewed

Git LFS Details

SHA256: 8444eb1725d6ce3e04718b02f4f6bf567d7df069f5b8d9b38266f9cfea2ccb45
Pointer size: 131 Bytes
Size of remote file: 201 kB

monitoring/screenshots/incident mail.png ADDED Viewed

Git LFS Details

SHA256: dc754c8d812d966a9ea1cb45611402a524e2ec3326fa6fbbb5d9db6d9d299e92
Pointer size: 131 Bytes
Size of remote file: 108 kB

monitoring/screenshots/incident resolved mail.png ADDED Viewed

Git LFS Details

SHA256: c55ee1a6ecf074e41ef073e27c04f3061008176a5fc5c7f9346e8e6ba078a645
Pointer size: 131 Bytes
Size of remote file: 110 kB

monitoring/screenshots/incident resolved.png ADDED Viewed

Git LFS Details

SHA256: 8fa04b3e42bb88ccd3a321352b75ada58eb640079d178cf7331105820ffdbbcf
Pointer size: 131 Bytes
Size of remote file: 208 kB

monitoring/screenshots/incident.png ADDED Viewed

Git LFS Details

SHA256: 6c98d1b7b44ba732b02be6a722bd6b0baa6e66f8982a17137a453a2088edc9d3
Pointer size: 131 Bytes
Size of remote file: 193 kB

monitoring/screenshots/monitors.png ADDED Viewed

Git LFS Details

SHA256: 8130d58c03ba6d02b42d527ec6dd9f7cf14f5116dacae6f6e713cb14c93e8a36
Pointer size: 130 Bytes
Size of remote file: 71.8 kB

nginx.conf ADDED Viewed

	@@ -0,0 +1,92 @@

+worker_processes 1;
+pid /tmp/nginx.pid;
+error_log stderr info; # Log to stderr to see errors in HF Space Logs
+events {
+    worker_connections 1024;
+}
+http {
+    include       /etc/nginx/mime.types;
+    default_type  application/octet-stream;
+    # HF Space runs as non-root, use /tmp for everything
+    access_log /dev/stdout;
+    client_body_temp_path /tmp/client_temp;
+    proxy_temp_path       /tmp/proxy_temp;
+    fastcgi_temp_path     /tmp/fastcgi_temp;
+    uwsgi_temp_path       /tmp/uwsgi_temp;
+    scgi_temp_path        /tmp/scgi_temp;
+    sendfile        on;
+    keepalive_timeout  65;
+    upstream streamlit {
+        server 127.0.0.1:8501;
+    }
+    upstream fastapi {
+        server 127.0.0.1:8000;
+    }
+    server {
+        listen 7860;
+        server_name localhost;
+        # Health endpoint for HF readiness check
+        location /health {
+            proxy_pass http://fastapi/health;
+            proxy_set_header Host $host;
+        }
+        # FastAPI Documentation
+        location /docs {
+            proxy_pass http://fastapi/docs;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+        location /redoc {
+            proxy_pass http://fastapi/redoc;
+            proxy_set_header Host $host;
+        }
+        location /openapi.json {
+            proxy_pass http://fastapi/openapi.json;
+            proxy_set_header Host $host;
+        }
+        # FastAPI API Endpoints
+        location /predict {
+            proxy_pass http://fastapi/predict;
+            proxy_set_header Host $host;
+        }
+        location /predictions {
+            proxy_pass http://fastapi/predictions;
+            proxy_set_header Host $host;
+        }
+        # Streamlit (Catch-all)
+        location / {
+            proxy_pass http://streamlit;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_set_header X-Forwarded-Host $host;
+            # WebSocket support for Streamlit
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+            proxy_read_timeout 86400;
+            # Prevent 502 if Streamlit is slow
+            proxy_connect_timeout 60s;
+            proxy_send_timeout 60s;
+        }
+    }
+}

reports/alerting_test_report/alerting_report.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Prometheus + Alertmanager Alerting Report
+This report documents the configuration and verification of the alerting system for the Hopcroft Project.
+## 1. Alerting Rules
+The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
+- **ServiceDown**: Triggers if a service is unreachable for 1 minute.
+- **HighErrorRate**: Triggers if the error rate exceeds 10%.
+- **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
+## 2. Alertmanager Configuration
+The `monitoring/alertmanager/config.yml` file includes:
+- **Grouping**: Alerts are grouped by `alertname` and `severity`.
+- **Inhibition**: Critical alerts suppress warning-level alerts.
+- **Receiver**: A webhook receiver is configured to forward notifications.
+## 3. Verification of "Firing" Alert
+The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
+### Verification Proofs:
+1. **Prometheus - Alert Firing**:
+The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
+![Prometheus Firing Alert](prometheus_firing.png)
+2. **Alertmanager - Notification Received**:
+The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
+![Alertmanager Firing Alert](alertmanager_firing.png)
+### Restoration
+Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.

reports/alerting_test_report/alertmanager_firing.png ADDED Viewed

Git LFS Details

SHA256: e1e229bae18738bef152e7401dc49c2f746f811c0dc0d8368153e97a2f50f6a0
Pointer size: 130 Bytes
Size of remote file: 46.2 kB

reports/alerting_test_report/prometheus_firing.png ADDED Viewed

Git LFS Details

SHA256: 2642504fc3d9c32e5f126f275bd4ff409cb0c832a9bb3836dd6fbd10452079d8
Pointer size: 130 Bytes
Size of remote file: 67.6 kB

requirements.txt CHANGED Viewed

@@ -23,6 +23,7 @@ sentence-transformers
 # API Framework
 fastapi[standard]>=0.115.0
 pydantic>=2.0.0
 uvicorn>=0.30.0
 httpx>=0.27.0
@@ -47,6 +48,8 @@ pytest-json-report>=1.5.0
 pytest-cov>=4.0.0
 pytest-xdist>=3.0.0
 # Data validation and quality
 great_expectations>=0.18.0
 deepchecks>=0.18.0

 # API Framework
 fastapi[standard]>=0.115.0
+prometheus-client>=0.17.0
 pydantic>=2.0.0
 uvicorn>=0.30.0
 httpx>=0.27.0
 pytest-cov>=4.0.0
 pytest-xdist>=3.0.0
+# Load testing
+locust>=2.20.0
 # Data validation and quality
 great_expectations>=0.18.0
 deepchecks>=0.18.0

scripts/start_space.sh CHANGED Viewed

@@ -16,28 +16,82 @@ USER=${DAGSHUB_USERNAME:-$MLFLOW_TRACKING_USERNAME}
 PASS=${DAGSHUB_TOKEN:-$MLFLOW_TRACKING_PASSWORD}
 if [ -n "$USER" ] && [ -n "$PASS" ]; then
-    echo "Configuring DVC authentication for DagsHub..."
     # Configure local config (not committed)
     dvc remote modify origin --local auth basic
     dvc remote modify origin --local user "$USER"
     dvc remote modify origin --local password "$PASS"
 else
-    echo "WARNING: No DagsHub credentials found. DVC pull might fail if the remote is private."
 fi
-echo "Pulling models from DVC..."
 # Pull only the necessary files for inference
 dvc pull models/random_forest_tfidf_gridsearch.pkl.dvc \
          models/tfidf_vectorizer.pkl.dvc \
-         models/label_names.pkl.dvc
-echo "Starting FastAPI application in background..."
-uvicorn hopcroft_skill_classification_tool_competition.main:app --host 0.0.0.0 --port 8000 &
 # Wait for API to start
-echo "Waiting for API to start..."
-sleep 10
-echo "Starting Streamlit application..."
-export API_BASE_URL="http://localhost:8000"
-streamlit run hopcroft_skill_classification_tool_competition/streamlit_app.py --server.port 7860 --server.address 0.0.0.0

 PASS=${DAGSHUB_TOKEN:-$MLFLOW_TRACKING_PASSWORD}
 if [ -n "$USER" ] && [ -n "$PASS" ]; then
+    echo "$(date) - Configuring DVC authentication for DagsHub..."
     # Configure local config (not committed)
     dvc remote modify origin --local auth basic
     dvc remote modify origin --local user "$USER"
     dvc remote modify origin --local password "$PASS"
 else
+    echo "$(date) - WARNING: No DagsHub credentials found. DVC pull might fail if the remote is private."
 fi
+echo "$(date) - Pulling models from DVC..."
 # Pull only the necessary files for inference
 dvc pull models/random_forest_tfidf_gridsearch.pkl.dvc \
          models/tfidf_vectorizer.pkl.dvc \
+         models/label_names.pkl.dvc || echo "DVC pull failed, but continuing..."
+# Create Nginx temp directories
+mkdir -p /tmp/client_temp /tmp/proxy_temp /tmp/fastcgi_temp /tmp/uwsgi_temp /tmp/scgi_temp
+echo "$(date) - Checking models existence..."
+ls -la models/
+echo "$(date) - Starting FastAPI application in background..."
+# Using 0.0.0.0 to be safe
+uvicorn hopcroft_skill_classification_tool_competition.main:app --host 0.0.0.0 --port 8000 >> /tmp/fastapi.log 2>&1 &
 # Wait for API to start
+echo "$(date) - Waiting for API to start (30s)..."
+for i in {1..30}; do
+    if curl -s http://127.0.0.1:8000/health > /dev/null; then
+        echo "$(date) - API is UP!"
+        break
+    fi
+    echo "$(date) - Waiting... ($i/30)"
+    sleep 2
+done
+echo "$(date) - Starting Nginx reverse proxy..."
+if ! command -v nginx &> /dev/null; then
+    echo "$(date) - ERROR: nginx not found in PATH"
+    exit 1
+fi
+nginx -c /app/nginx.conf -g "daemon off;" >> /tmp/nginx_startup.log 2>&1 &
+echo "$(date) - Waiting for Nginx to initialize..."
+sleep 5
+# Check if Nginx is running
+if ps aux | grep -v grep | grep -q "nginx"; then
+    echo "$(date) - Nginx is running."
+else
+    echo "$(date) - ERROR: Nginx failed to start. Logs:"
+    cat /tmp/nginx_startup.log
+fi
+echo "$(date) - Final backend check before starting Streamlit..."
+curl -v http://127.0.0.1:8000/health || echo "FastAPI health check failed!"
+echo "$(date) - Starting Streamlit application on 127.0.0.1:8501..."
+export API_BASE_URL="http://127.0.0.1:8000"
+streamlit run hopcroft_skill_classification_tool_competition/streamlit_app.py \
+    --server.port 8501 \
+    --server.address 127.0.0.1 \
+    --server.enableCORS=false \
+    --server.enableXsrfProtection=false \
+    --server.headless true &
+# Wait for Streamlit to start
+echo "$(date) - Waiting for Streamlit to start (30s)..."
+for i in {1..30}; do
+    if curl -s http://127.0.0.1:8501/healthz > /dev/null; then
+        echo "$(date) - Streamlit is UP!"
+        break
+    fi
+    echo "$(date) - Waiting for Streamlit... ($i/30)"
+    sleep 2
+done
+echo "$(date) - Process started. Tailing Nginx logs for debug..."
+tail -f /tmp/nginx_startup.log /tmp/fastapi.log