maurocarlu's picture
nginx endpoints addition - grafana documentation update
70cbf15
# Design Choices
Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.
---
## Table of Contents
1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering)
2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines)
3. [Quality Assurance](#3-quality-assurance)
4. [API (Inference Service)](#4-api-inference-service)
5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd)
6. [Monitoring](#6-monitoring)
---
## 1. Inception (Requirements Engineering)
### Machine Learning Canvas
The project adopted the **Machine Learning Canvas** framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.
| Canvas Section | Application |
|----------------|-------------|
| **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue text |
| **Decisions** | Automated developer assignment based on predicted skill requirements |
| **Value Proposition** | Reduced issue resolution time, optimized resource allocation |
| **Data Sources** | SkillScope DB (7,245 PRs from 11 Java repositories) |
| **Making Predictions** | Real-time classification upon issue creation |
| **Building Models** | Iterative improvement over RF+TF-IDF baseline |
| **Monitoring** | Continuous evaluation with drift detection |
The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md).
### Functional vs Non-Functional Requirements
#### Functional Requirements
| Requirement | Target | Metric |
|-------------|--------|--------|
| **Precision** | β‰₯ Baseline | True positives / Predicted positives |
| **Recall** | β‰₯ Baseline | True positives / Actual positives |
| **Micro-F1** | > Baseline | Harmonic mean across all labels |
| **Multi-label Support** | 217 skills | Simultaneous prediction of multiple labels |
#### Non-Functional Requirements
| Category | Requirement | Implementation |
|----------|-------------|----------------|
| **Reproducibility** | Auditable experiments | MLflow tracking, DVC versioning |
| **Explainability** | Interpretable predictions | Confidence scores per skill |
| **Performance** | Low latency inference | FastAPI async, model caching |
| **Scalability** | Batch processing | `/predict/batch` endpoint (max 100) |
| **Maintainability** | Clean code | Ruff linting, type hints, docstrings |
### System-First vs Model-First Development
The project adopted a **System-First** approach, prioritizing infrastructure and pipeline development before model optimization:
```
Timeline:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 1: Infrastructure β”‚ Phase 2: Model Development β”‚
β”‚ - DVC/MLflow setup β”‚ - Feature engineering β”‚
β”‚ - CI/CD pipeline β”‚ - Hyperparameter tuning β”‚
β”‚ - Docker containers β”‚ - SMOTE/ADASYN experiments β”‚
β”‚ - API skeleton β”‚ - Performance optimization β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Rationale:**
- Enables rapid iteration once infrastructure is stable
- Ensures reproducibility from day one
- Reduces technical debt during model development
- Facilitates team collaboration with shared tooling
---
## 2. Reproducibility (Versioning & Pipelines)
### Code Versioning (Git)
Standard Git workflow with branch protection:
| Branch | Purpose |
|--------|---------|
| `main` | Production-ready code |
| `feature/*` | New development |
| `milestone/*` | Grouping all features before merging into main |
### Data & Model Versioning (DVC)
**Design Decision:** Use DVC (Data Version Control) with DagsHub remote storage for large file management.
```
.dvc/config
β”œβ”€β”€ remote: origin
β”œβ”€β”€ url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
└── auth: basic (credentials via environment)
```
**Tracked Artifacts:**
| File | Purpose |
|------|---------|
| `data/raw/skillscope_data.db` | Original SQLite database |
| `data/processed/*.npy` | TF-IDF and embedding features |
| `models/*.pkl` | Trained models and vectorizers |
**Versioning Workflow:**
```bash
# Track new data
dvc add data/raw/new_dataset.db
git add data/raw/.gitignore data/raw/new_dataset.db.dvc
# Push to remote
dvc push
git commit -m "Add new dataset version"
git push
```
### Experiment Tracking (MLflow)
**Design Decision:** Remote MLflow instance on DagsHub for collaborative experiment tracking.
| Configuration | Value |
|---------------|-------|
| Tracking URI | `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` |
| Experiments | `skill_classification`, `skill_prediction_api` |
**Logged Metrics:**
- Training: precision, recall, F1-score, training time
- Inference: prediction latency, confidence scores, timestamps
**Artifact Storage:**
- Model binaries (`.pkl`)
- Vectorizers and scalers
- Hyperparameter configurations
### Auditable ML Pipeline
The pipeline is designed for complete reproducibility:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dataset.py │───▢│ features.py │───▢│ train.py β”‚
β”‚ (DVC pull) β”‚ β”‚ (TF-IDF) β”‚ β”‚ (MLflow) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
.dvc files .dvc files MLflow Run
```
---
## 3. Quality Assurance
### Testing Strategy
#### Static Analysis (Ruff)
**Design Decision:** Use Ruff as the primary linter for speed and comprehensive rule coverage.
| Configuration | Value |
|---------------|-------|
| Line Length | 88 (Black compatible) |
| Target Python | 3.10+ |
| Rule Sets | PEP 8, isort, pyflakes |
**CI Integration:**
```yaml
- name: Lint with Ruff
run: make lint
```
#### Dynamic Testing (Pytest)
**Test Organization:**
```
tests/
β”œβ”€β”€ unit/ # Isolated function tests
β”œβ”€β”€ integration/ # Component interaction tests
β”œβ”€β”€ system/ # End-to-end tests
β”œβ”€β”€ behavioral/ # ML-specific tests
β”œβ”€β”€ deepchecks/ # Data validation
└── great expectations/ # Schema validation
```
**Markers for Selective Execution:**
```python
@pytest.mark.unit
@pytest.mark.integration
@pytest.mark.system
@pytest.mark.slow
```
### Model Validation vs Model Verification
| Concept | Definition | Implementation |
|---------|------------|----------------|
| **Validation** | Does the model fit user needs? | Micro-F1 vs baseline comparison |
| **Verification** | Is the model correctly built? | Unit tests, behavioral tests |
### Behavioral Testing
**Design Decision:** Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.
| Test Type | Count | Purpose |
|-----------|-------|---------|
| **Invariance** | 9 | Stability under perturbations (typos, case changes) |
| **Directional** | 10 | Expected behavior with keyword additions |
| **Minimum Functionality** | 17 | Basic sanity checks on clear examples |
**Example Invariance Test:**
```python
def test_case_insensitivity():
"""Model should predict same skills regardless of case."""
assert predict("Fix BUG") == predict("fix bug")
```
### Data Quality Checks
#### Great Expectations (10 Tests)
**Design Decision:** Validate data at pipeline boundaries to catch quality issues early.
| Validation Point | Tests |
|------------------|-------|
| Raw Database | Schema, row count, required columns |
| Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility |
| Label Matrix | Binary format, distribution, consistency |
| Train/Test Split | No leakage, stratification |
#### Deepchecks (24 Checks)
**Suites:**
- **Data Integrity Suite** (12 checks): Duplicates, nulls, correlations
- **Train-Test Validation Suite** (12 checks): Leakage, drift, distribution
**Status:** Production-ready (96% overall score)
---
## 4. API (Inference Service)
### FastAPI Implementation
**Design Decision:** Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.
**Key Features:**
- Async lifespan management for model loading
- Middleware for Prometheus metrics collection
- Structured exception handling
### RESTful Principles
**Design Decision:** Follow REST best practices for intuitive API design.
| Principle | Implementation |
|-----------|----------------|
| **Nouns, not verbs** | `/predictions` instead of `/getPrediction` |
| **Plural resources** | `/predictions`, `/issues` |
| **HTTP methods** | GET (retrieve), POST (create) |
| **Status codes** | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) |
**Endpoint Design:**
| Method | Endpoint | Action |
|--------|----------|--------|
| `POST` | `/predict` | Create new prediction |
| `POST` | `/predict/batch` | Create batch predictions |
| `GET` | `/predictions` | List predictions |
| `GET` | `/predictions/{run_id}` | Get specific prediction |
### OpenAPI/Swagger Documentation
**Auto-generated documentation at runtime:**
- Swagger UI: `/docs`
- ReDoc: `/redoc`
- OpenAPI JSON: `/openapi.json`
**Pydantic Models for Schema Enforcement:**
```python
class IssueInput(BaseModel):
issue_text: str
repo_name: Optional[str] = None
pr_number: Optional[int] = None
class PredictionResponse(BaseModel):
run_id: str
predictions: List[SkillPrediction]
model_version: str
```
---
## 5. Deployment (Containerization & CI/CD)
### Docker Containerization
**Design Decision:** Multi-stage Docker builds with security best practices.
**Dockerfile Features:**
- Python 3.10 slim base image (minimal footprint)
- Non-root user for security
- DVC integration for model pulling
- Health check endpoint configuration
**Multi-Service Architecture:**
```
docker-compose.yml
β”œβ”€β”€ hopcroft-api (FastAPI)
β”‚ β”œβ”€β”€ Port: 8080
β”‚ β”œβ”€β”€ Volumes: source code, logs
β”‚ └── Health check: /health
β”‚
β”œβ”€β”€ hopcroft-gui (Streamlit)
β”‚ β”œβ”€β”€ Port: 8501
β”‚ β”œβ”€β”€ Depends on: hopcroft-api
β”‚ └── Environment: API_BASE_URL
β”‚
└── hopcroft-net (Bridge network)
```
**Design Rationale:**
- Separation of concerns (API vs GUI)
- Independent scaling
- Health-based dependency management
- Shared network for internal communication
### CI/CD Pipeline (GitHub Actions)
**Design Decision:** Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.
**Pipeline Stages:**
```yaml
Jobs:
unit-tests:
- Checkout code
- Setup Python 3.10
- Install dependencies
- Ruff linting
- Pytest unit tests
- Upload test report (on failure)
build-image:
- Needs: unit-tests
- Configure DVC credentials
- Pull models
- Build Docker image
```
**Triggers:**
- Push to `main`, `feature/*`
- Pull requests to `main`
**Secrets Management:**
- `DAGSHUB_USERNAME`: DagsHub authentication
- `DAGSHUB_TOKEN`: DagsHub access token
### Hugging Face Spaces Hosting
**Design Decision:** Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.
**Configuration:**
```yaml
---
title: Hopcroft Skill Classification
sdk: docker
app_port: 7860
---
```
**Startup Flow:**
1. `start_space.sh` configures DVC credentials
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx (port 7860) for routing
**Nginx Reverse Proxy:**
- `/` β†’ Streamlit GUI
- `/docs`, `/predict`, `/predictions` β†’ FastAPI
- `/prometheus` β†’ Prometheus metrics
---
## 6. Monitoring
### Resource-Level Monitoring
**Design Decision:** Implement Prometheus metrics for real-time observability.
| Metric | Type | Purpose |
|--------|------|---------|
| `hopcroft_requests_total` | Counter | Request volume by endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) |
| `hopcroft_in_progress_requests` | Gauge | Concurrent request load |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |
**Middleware Implementation:**
```python
@app.middleware("http")
async def monitor_requests(request, call_next):
IN_PROGRESS.inc()
with REQUEST_LATENCY.labels(method, endpoint).time():
response = await call_next(request)
REQUESTS_TOTAL.labels(method, endpoint, status).inc()
IN_PROGRESS.dec()
return response
```
### Performance-Level Monitoring
**Model Staleness Indicators:**
- Prediction confidence trends over time
- Drift detection alerts
- Error rate monitoring
### Drift Detection Strategy
**Design Decision:** Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.
| Component | Details |
|-----------|---------|
| **Algorithm** | KS Two-Sample Test |
| **Baseline** | 1000 samples from training data |
| **Threshold** | p-value < 0.05 (Bonferroni corrected) |
| **Execution** | Scheduled via cron or manual trigger |
**Drift Types Monitored:**
| Type | Definition | Detection Method |
|------|------------|------------------|
| **Data Drift** | Feature distribution shift | KS test on input features |
| **Target Drift** | Label distribution shift | Chi-square test on predictions |
| **Concept Drift** | Relationship change | Performance degradation monitoring |
**Metrics Published to Pushgateway:**
- `drift_detected`: Binary indicator (0/1)
- `drift_p_value`: Statistical significance
- `drift_distance`: KS distance metric
- `drift_check_timestamp`: Last check time
### Alerting Configuration
**Prometheus Alert Rules:**
| Alert | Condition | Severity |
|-------|-----------|----------|
| `ServiceDown` | Target down for 5m | Critical |
| `HighErrorRate` | 5xx rate > 10% | Warning |
| `SlowRequests` | P95 latency > 2s | Warning |
| `DriftDetected` | drift_detected = 1 | Warning |
**Alertmanager Integration:**
- Severity-based routing
- Email notifications
- Inhibition rules to prevent alert storms
### Grafana Visualization
**Dashboard Panels:**
1. Request Rate (gauge)
2. Request Latency p50/p95 (time series)
3. In-Progress Requests (stat panel)
4. Error Rate 5xx (stat panel)
5. Model Prediction Time (time series)
6. Requests by Endpoint (bar chart)
**Data Sources:**
- Prometheus: Real-time metrics
- Pushgateway: Batch job metrics (drift detection)
### HF Spaces Deployment
Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:
| Service | Production URL |
|---------|----------------|
| Prometheus | `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` |
| Grafana | `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` |
This enables real-time monitoring of the production deployment without additional infrastructure.