maurocarlu's picture
nginx endpoints addition - grafana documentation update
70cbf15

Design Choices

Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.


Table of Contents

  1. Inception (Requirements Engineering)
  2. Reproducibility (Versioning & Pipelines)
  3. Quality Assurance
  4. API (Inference Service)
  5. Deployment (Containerization & CI/CD)
  6. Monitoring

1. Inception (Requirements Engineering)

Machine Learning Canvas

The project adopted the Machine Learning Canvas framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.

Canvas Section Application
Prediction Task Multi-label classification of 217 technical skills from GitHub issue text
Decisions Automated developer assignment based on predicted skill requirements
Value Proposition Reduced issue resolution time, optimized resource allocation
Data Sources SkillScope DB (7,245 PRs from 11 Java repositories)
Making Predictions Real-time classification upon issue creation
Building Models Iterative improvement over RF+TF-IDF baseline
Monitoring Continuous evaluation with drift detection

The complete ML Canvas is documented in ML Canvas.md.

Functional vs Non-Functional Requirements

Functional Requirements

Requirement Target Metric
Precision β‰₯ Baseline True positives / Predicted positives
Recall β‰₯ Baseline True positives / Actual positives
Micro-F1 > Baseline Harmonic mean across all labels
Multi-label Support 217 skills Simultaneous prediction of multiple labels

Non-Functional Requirements

Category Requirement Implementation
Reproducibility Auditable experiments MLflow tracking, DVC versioning
Explainability Interpretable predictions Confidence scores per skill
Performance Low latency inference FastAPI async, model caching
Scalability Batch processing /predict/batch endpoint (max 100)
Maintainability Clean code Ruff linting, type hints, docstrings

System-First vs Model-First Development

The project adopted a System-First approach, prioritizing infrastructure and pipeline development before model optimization:

Timeline:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 1: Infrastructure β”‚ Phase 2: Model Development        β”‚
β”‚ - DVC/MLflow setup      β”‚ - Feature engineering              β”‚
β”‚ - CI/CD pipeline        β”‚ - Hyperparameter tuning            β”‚
β”‚ - Docker containers     β”‚ - SMOTE/ADASYN experiments         β”‚
β”‚ - API skeleton          β”‚ - Performance optimization         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rationale:

  • Enables rapid iteration once infrastructure is stable
  • Ensures reproducibility from day one
  • Reduces technical debt during model development
  • Facilitates team collaboration with shared tooling

2. Reproducibility (Versioning & Pipelines)

Code Versioning (Git)

Standard Git workflow with branch protection:

Branch Purpose
main Production-ready code
feature/* New development
milestone/* Grouping all features before merging into main

Data & Model Versioning (DVC)

Design Decision: Use DVC (Data Version Control) with DagsHub remote storage for large file management.

.dvc/config
β”œβ”€β”€ remote: origin
β”œβ”€β”€ url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
└── auth: basic (credentials via environment)

Tracked Artifacts:

File Purpose
data/raw/skillscope_data.db Original SQLite database
data/processed/*.npy TF-IDF and embedding features
models/*.pkl Trained models and vectorizers

Versioning Workflow:

# Track new data
dvc add data/raw/new_dataset.db
git add data/raw/.gitignore data/raw/new_dataset.db.dvc

# Push to remote
dvc push
git commit -m "Add new dataset version"
git push

Experiment Tracking (MLflow)

Design Decision: Remote MLflow instance on DagsHub for collaborative experiment tracking.

Configuration Value
Tracking URI https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
Experiments skill_classification, skill_prediction_api

Logged Metrics:

  • Training: precision, recall, F1-score, training time
  • Inference: prediction latency, confidence scores, timestamps

Artifact Storage:

  • Model binaries (.pkl)
  • Vectorizers and scalers
  • Hyperparameter configurations

Auditable ML Pipeline

The pipeline is designed for complete reproducibility:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   dataset.py │───▢│  features.py │───▢│   train.py   β”‚
β”‚   (DVC pull) β”‚    β”‚  (TF-IDF)    β”‚    β”‚  (MLflow)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚                   β”‚
       β–Ό                   β–Ό                   β–Ό
    .dvc files         .dvc files          MLflow Run

3. Quality Assurance

Testing Strategy

Static Analysis (Ruff)

Design Decision: Use Ruff as the primary linter for speed and comprehensive rule coverage.

Configuration Value
Line Length 88 (Black compatible)
Target Python 3.10+
Rule Sets PEP 8, isort, pyflakes

CI Integration:

- name: Lint with Ruff
  run: make lint

Dynamic Testing (Pytest)

Test Organization:

tests/
β”œβ”€β”€ unit/              # Isolated function tests
β”œβ”€β”€ integration/       # Component interaction tests
β”œβ”€β”€ system/            # End-to-end tests
β”œβ”€β”€ behavioral/        # ML-specific tests
β”œβ”€β”€ deepchecks/        # Data validation
└── great expectations/ # Schema validation

Markers for Selective Execution:

@pytest.mark.unit
@pytest.mark.integration
@pytest.mark.system
@pytest.mark.slow

Model Validation vs Model Verification

Concept Definition Implementation
Validation Does the model fit user needs? Micro-F1 vs baseline comparison
Verification Is the model correctly built? Unit tests, behavioral tests

Behavioral Testing

Design Decision: Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.

Test Type Count Purpose
Invariance 9 Stability under perturbations (typos, case changes)
Directional 10 Expected behavior with keyword additions
Minimum Functionality 17 Basic sanity checks on clear examples

Example Invariance Test:

def test_case_insensitivity():
    """Model should predict same skills regardless of case."""
    assert predict("Fix BUG") == predict("fix bug")

Data Quality Checks

Great Expectations (10 Tests)

Design Decision: Validate data at pipeline boundaries to catch quality issues early.

Validation Point Tests
Raw Database Schema, row count, required columns
Feature Matrix No NaN/Inf, sparsity, SMOTE compatibility
Label Matrix Binary format, distribution, consistency
Train/Test Split No leakage, stratification

Deepchecks (24 Checks)

Suites:

  • Data Integrity Suite (12 checks): Duplicates, nulls, correlations
  • Train-Test Validation Suite (12 checks): Leakage, drift, distribution

Status: Production-ready (96% overall score)


4. API (Inference Service)

FastAPI Implementation

Design Decision: Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.

Key Features:

  • Async lifespan management for model loading
  • Middleware for Prometheus metrics collection
  • Structured exception handling

RESTful Principles

Design Decision: Follow REST best practices for intuitive API design.

Principle Implementation
Nouns, not verbs /predictions instead of /getPrediction
Plural resources /predictions, /issues
HTTP methods GET (retrieve), POST (create)
Status codes 200 (OK), 201 (Created), 404 (Not Found), 500 (Error)

Endpoint Design:

Method Endpoint Action
POST /predict Create new prediction
POST /predict/batch Create batch predictions
GET /predictions List predictions
GET /predictions/{run_id} Get specific prediction

OpenAPI/Swagger Documentation

Auto-generated documentation at runtime:

  • Swagger UI: /docs
  • ReDoc: /redoc
  • OpenAPI JSON: /openapi.json

Pydantic Models for Schema Enforcement:

class IssueInput(BaseModel):
    issue_text: str
    repo_name: Optional[str] = None
    pr_number: Optional[int] = None

class PredictionResponse(BaseModel):
    run_id: str
    predictions: List[SkillPrediction]
    model_version: str

5. Deployment (Containerization & CI/CD)

Docker Containerization

Design Decision: Multi-stage Docker builds with security best practices.

Dockerfile Features:

  • Python 3.10 slim base image (minimal footprint)
  • Non-root user for security
  • DVC integration for model pulling
  • Health check endpoint configuration

Multi-Service Architecture:

docker-compose.yml
β”œβ”€β”€ hopcroft-api (FastAPI)
β”‚   β”œβ”€β”€ Port: 8080
β”‚   β”œβ”€β”€ Volumes: source code, logs
β”‚   └── Health check: /health
β”‚
β”œβ”€β”€ hopcroft-gui (Streamlit)
β”‚   β”œβ”€β”€ Port: 8501
β”‚   β”œβ”€β”€ Depends on: hopcroft-api
β”‚   └── Environment: API_BASE_URL
β”‚
└── hopcroft-net (Bridge network)

Design Rationale:

  • Separation of concerns (API vs GUI)
  • Independent scaling
  • Health-based dependency management
  • Shared network for internal communication

CI/CD Pipeline (GitHub Actions)

Design Decision: Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.

Pipeline Stages:

Jobs:
  unit-tests:
    - Checkout code
    - Setup Python 3.10
    - Install dependencies
    - Ruff linting
    - Pytest unit tests
    - Upload test report (on failure)

  build-image:
    - Needs: unit-tests
    - Configure DVC credentials
    - Pull models
    - Build Docker image

Triggers:

  • Push to main, feature/*
  • Pull requests to main

Secrets Management:

  • DAGSHUB_USERNAME: DagsHub authentication
  • DAGSHUB_TOKEN: DagsHub access token

Hugging Face Spaces Hosting

Design Decision: Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.

Configuration: ```yaml

title: Hopcroft Skill Classification sdk: docker app_port: 7860


**Startup Flow:**
1. `start_space.sh` configures DVC credentials
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx (port 7860) for routing

**Nginx Reverse Proxy:**
- `/` β†’ Streamlit GUI
- `/docs`, `/predict`, `/predictions` β†’ FastAPI
- `/prometheus` β†’ Prometheus metrics

---

## 6. Monitoring

### Resource-Level Monitoring

**Design Decision:** Implement Prometheus metrics for real-time observability.

| Metric | Type | Purpose |
|--------|------|---------|
| `hopcroft_requests_total` | Counter | Request volume by endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) |
| `hopcroft_in_progress_requests` | Gauge | Concurrent request load |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |

**Middleware Implementation:**
```python
@app.middleware("http")
async def monitor_requests(request, call_next):
    IN_PROGRESS.inc()
    with REQUEST_LATENCY.labels(method, endpoint).time():
        response = await call_next(request)
    REQUESTS_TOTAL.labels(method, endpoint, status).inc()
    IN_PROGRESS.dec()
    return response

Performance-Level Monitoring

Model Staleness Indicators:

  • Prediction confidence trends over time
  • Drift detection alerts
  • Error rate monitoring

Drift Detection Strategy

Design Decision: Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.

Component Details
Algorithm KS Two-Sample Test
Baseline 1000 samples from training data
Threshold p-value < 0.05 (Bonferroni corrected)
Execution Scheduled via cron or manual trigger

Drift Types Monitored:

Type Definition Detection Method
Data Drift Feature distribution shift KS test on input features
Target Drift Label distribution shift Chi-square test on predictions
Concept Drift Relationship change Performance degradation monitoring

Metrics Published to Pushgateway:

  • drift_detected: Binary indicator (0/1)
  • drift_p_value: Statistical significance
  • drift_distance: KS distance metric
  • drift_check_timestamp: Last check time

Alerting Configuration

Prometheus Alert Rules:

Alert Condition Severity
ServiceDown Target down for 5m Critical
HighErrorRate 5xx rate > 10% Warning
SlowRequests P95 latency > 2s Warning
DriftDetected drift_detected = 1 Warning

Alertmanager Integration:

  • Severity-based routing
  • Email notifications
  • Inhibition rules to prevent alert storms

Grafana Visualization

Dashboard Panels:

  1. Request Rate (gauge)
  2. Request Latency p50/p95 (time series)
  3. In-Progress Requests (stat panel)
  4. Error Rate 5xx (stat panel)
  5. Model Prediction Time (time series)
  6. Requests by Endpoint (bar chart)

Data Sources:

  • Prometheus: Real-time metrics
  • Pushgateway: Batch job metrics (drift detection)

HF Spaces Deployment

Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:

Service Production URL
Prometheus https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/
Grafana https://dacrow13-hopcroft-skill-classification.hf.space/grafana/

This enables real-time monitoring of the production deployment without additional infrastructure.