Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / docs /design_choices.md

maurocarlu

nginx endpoints addition - grafana documentation update

70cbf15 21 days ago

preview code

raw

history blame contribute delete

15.4 kB

Design Choices

Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.

Inception (Requirements Engineering)
Reproducibility (Versioning & Pipelines)
Quality Assurance
API (Inference Service)
Deployment (Containerization & CI/CD)
Monitoring

1. Inception (Requirements Engineering)

Machine Learning Canvas

The project adopted the Machine Learning Canvas framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.

Canvas Section	Application
Prediction Task	Multi-label classification of 217 technical skills from GitHub issue text
Decisions	Automated developer assignment based on predicted skill requirements
Value Proposition	Reduced issue resolution time, optimized resource allocation
Data Sources	SkillScope DB (7,245 PRs from 11 Java repositories)
Making Predictions	Real-time classification upon issue creation
Building Models	Iterative improvement over RF+TF-IDF baseline
Monitoring	Continuous evaluation with drift detection

The complete ML Canvas is documented in ML Canvas.md.

Functional vs Non-Functional Requirements

Functional Requirements

Requirement	Target	Metric
Precision	≥ Baseline	True positives / Predicted positives
Recall	≥ Baseline	True positives / Actual positives
Micro-F1	> Baseline	Harmonic mean across all labels
Multi-label Support	217 skills	Simultaneous prediction of multiple labels

Non-Functional Requirements

Category	Requirement	Implementation
Reproducibility	Auditable experiments	MLflow tracking, DVC versioning
Explainability	Interpretable predictions	Confidence scores per skill
Performance	Low latency inference	FastAPI async, model caching
Scalability	Batch processing	`/predict/batch` endpoint (max 100)
Maintainability	Clean code	Ruff linting, type hints, docstrings

System-First vs Model-First Development

The project adopted a System-First approach, prioritizing infrastructure and pipeline development before model optimization:

Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Infrastructure │ Phase 2: Model Development        │
│ - DVC/MLflow setup      │ - Feature engineering              │
│ - CI/CD pipeline        │ - Hyperparameter tuning            │
│ - Docker containers     │ - SMOTE/ADASYN experiments         │
│ - API skeleton          │ - Performance optimization         │
└─────────────────────────────────────────────────────────────┘

Rationale:

Enables rapid iteration once infrastructure is stable
Ensures reproducibility from day one
Reduces technical debt during model development
Facilitates team collaboration with shared tooling

2. Reproducibility (Versioning & Pipelines)

Code Versioning (Git)

Standard Git workflow with branch protection:

Branch	Purpose
`main`	Production-ready code
`feature/*`	New development
`milestone/*`	Grouping all features before merging into main

Data & Model Versioning (DVC)

Design Decision: Use DVC (Data Version Control) with DagsHub remote storage for large file management.

.dvc/config
├── remote: origin
├── url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
└── auth: basic (credentials via environment)

Tracked Artifacts:

File	Purpose
`data/raw/skillscope_data.db`	Original SQLite database
`data/processed/*.npy`	TF-IDF and embedding features
`models/*.pkl`	Trained models and vectorizers

Versioning Workflow:

# Track new data
dvc add data/raw/new_dataset.db
git add data/raw/.gitignore data/raw/new_dataset.db.dvc

# Push to remote
dvc push
git commit -m "Add new dataset version"
git push

Experiment Tracking (MLflow)

Design Decision: Remote MLflow instance on DagsHub for collaborative experiment tracking.

Configuration	Value
Tracking URI	`https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
Experiments	`skill_classification`, `skill_prediction_api`

Logged Metrics:

Training: precision, recall, F1-score, training time
Inference: prediction latency, confidence scores, timestamps

Artifact Storage:

Model binaries (.pkl)
Vectorizers and scalers
Hyperparameter configurations

Auditable ML Pipeline

The pipeline is designed for complete reproducibility:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   dataset.py │───▶│  features.py │───▶│   train.py   │
│   (DVC pull) │    │  (TF-IDF)    │    │  (MLflow)    │
└──────────────┘    └──────────────┘    └──────────────┘
       │                   │                   │
       ▼                   ▼                   ▼
    .dvc files         .dvc files          MLflow Run

3. Quality Assurance

Testing Strategy

Static Analysis (Ruff)

Design Decision: Use Ruff as the primary linter for speed and comprehensive rule coverage.

Configuration	Value
Line Length	88 (Black compatible)
Target Python	3.10+
Rule Sets	PEP 8, isort, pyflakes

CI Integration:

- name: Lint with Ruff
  run: make lint

Dynamic Testing (Pytest)

Test Organization:

tests/
├── unit/              # Isolated function tests
├── integration/       # Component interaction tests
├── system/            # End-to-end tests
├── behavioral/        # ML-specific tests
├── deepchecks/        # Data validation
└── great expectations/ # Schema validation

Markers for Selective Execution:

@pytest.mark.unit
@pytest.mark.integration
@pytest.mark.system
@pytest.mark.slow

Model Validation vs Model Verification

Concept	Definition	Implementation
Validation	Does the model fit user needs?	Micro-F1 vs baseline comparison
Verification	Is the model correctly built?	Unit tests, behavioral tests

Behavioral Testing

Design Decision: Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.

Test Type	Count	Purpose
Invariance	9	Stability under perturbations (typos, case changes)
Directional	10	Expected behavior with keyword additions
Minimum Functionality	17	Basic sanity checks on clear examples

Example Invariance Test:

def test_case_insensitivity():
    """Model should predict same skills regardless of case."""
    assert predict("Fix BUG") == predict("fix bug")

Data Quality Checks

Great Expectations (10 Tests)

Design Decision: Validate data at pipeline boundaries to catch quality issues early.

Validation Point	Tests
Raw Database	Schema, row count, required columns
Feature Matrix	No NaN/Inf, sparsity, SMOTE compatibility
Label Matrix	Binary format, distribution, consistency
Train/Test Split	No leakage, stratification

Deepchecks (24 Checks)

Suites:

Data Integrity Suite (12 checks): Duplicates, nulls, correlations
Train-Test Validation Suite (12 checks): Leakage, drift, distribution

Status: Production-ready (96% overall score)

4. API (Inference Service)

FastAPI Implementation

Design Decision: Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.

Key Features:

Async lifespan management for model loading
Middleware for Prometheus metrics collection
Structured exception handling

RESTful Principles

Design Decision: Follow REST best practices for intuitive API design.

Principle	Implementation
Nouns, not verbs	`/predictions` instead of `/getPrediction`
Plural resources	`/predictions`, `/issues`
HTTP methods	GET (retrieve), POST (create)
Status codes	200 (OK), 201 (Created), 404 (Not Found), 500 (Error)

Endpoint Design:

Method	Endpoint	Action
`POST`	`/predict`	Create new prediction
`POST`	`/predict/batch`	Create batch predictions
`GET`	`/predictions`	List predictions
`GET`	`/predictions/{run_id}`	Get specific prediction

OpenAPI/Swagger Documentation

Auto-generated documentation at runtime:

Swagger UI: /docs
ReDoc: /redoc
OpenAPI JSON: /openapi.json

Pydantic Models for Schema Enforcement:

class IssueInput(BaseModel):
    issue_text: str
    repo_name: Optional[str] = None
    pr_number: Optional[int] = None

class PredictionResponse(BaseModel):
    run_id: str
    predictions: List[SkillPrediction]
    model_version: str

5. Deployment (Containerization & CI/CD)

Docker Containerization

Design Decision: Multi-stage Docker builds with security best practices.

Dockerfile Features:

Python 3.10 slim base image (minimal footprint)
Non-root user for security
DVC integration for model pulling
Health check endpoint configuration

Multi-Service Architecture:

docker-compose.yml
├── hopcroft-api (FastAPI)
│   ├── Port: 8080
│   ├── Volumes: source code, logs
│   └── Health check: /health
│
├── hopcroft-gui (Streamlit)
│   ├── Port: 8501
│   ├── Depends on: hopcroft-api
│   └── Environment: API_BASE_URL
│
└── hopcroft-net (Bridge network)

Design Rationale:

Separation of concerns (API vs GUI)
Independent scaling
Health-based dependency management
Shared network for internal communication

CI/CD Pipeline (GitHub Actions)

Design Decision: Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.

Pipeline Stages:

Jobs:
  unit-tests:
    - Checkout code
    - Setup Python 3.10
    - Install dependencies
    - Ruff linting
    - Pytest unit tests
    - Upload test report (on failure)

  build-image:
    - Needs: unit-tests
    - Configure DVC credentials
    - Pull models
    - Build Docker image

Triggers:

Push to main, feature/*
Pull requests to main

Secrets Management:

DAGSHUB_USERNAME: DagsHub authentication
DAGSHUB_TOKEN: DagsHub access token

Hugging Face Spaces Hosting

Design Decision: Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.

Configuration: ```yaml

title: Hopcroft Skill Classification sdk: docker app_port: 7860


**Startup Flow:**
1. `start_space.sh` configures DVC credentials
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx (port 7860) for routing

**Nginx Reverse Proxy:**
- `/` → Streamlit GUI
- `/docs`, `/predict`, `/predictions` → FastAPI
- `/prometheus` → Prometheus metrics

---

## 6. Monitoring

### Resource-Level Monitoring

**Design Decision:** Implement Prometheus metrics for real-time observability.

| Metric | Type | Purpose |
|--------|------|---------|
| `hopcroft_requests_total` | Counter | Request volume by endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) |
| `hopcroft_in_progress_requests` | Gauge | Concurrent request load |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |

**Middleware Implementation:**
```python
@app.middleware("http")
async def monitor_requests(request, call_next):
    IN_PROGRESS.inc()
    with REQUEST_LATENCY.labels(method, endpoint).time():
        response = await call_next(request)
    REQUESTS_TOTAL.labels(method, endpoint, status).inc()
    IN_PROGRESS.dec()
    return response

Performance-Level Monitoring

Model Staleness Indicators:

Prediction confidence trends over time
Drift detection alerts
Error rate monitoring

Drift Detection Strategy

Design Decision: Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.

Component	Details
Algorithm	KS Two-Sample Test
Baseline	1000 samples from training data
Threshold	p-value < 0.05 (Bonferroni corrected)
Execution	Scheduled via cron or manual trigger

Drift Types Monitored:

Type	Definition	Detection Method
Data Drift	Feature distribution shift	KS test on input features
Target Drift	Label distribution shift	Chi-square test on predictions
Concept Drift	Relationship change	Performance degradation monitoring

Metrics Published to Pushgateway:

drift_detected: Binary indicator (0/1)
drift_p_value: Statistical significance
drift_distance: KS distance metric
drift_check_timestamp: Last check time

Alerting Configuration

Prometheus Alert Rules:

Alert	Condition	Severity
`ServiceDown`	Target down for 5m	Critical
`HighErrorRate`	5xx rate > 10%	Warning
`SlowRequests`	P95 latency > 2s	Warning
`DriftDetected`	drift_detected = 1	Warning

Alertmanager Integration:

Severity-based routing
Email notifications
Inhibition rules to prevent alert storms

Grafana Visualization

Dashboard Panels:

Request Rate (gauge)
Request Latency p50/p95 (time series)
In-Progress Requests (stat panel)
Error Rate 5xx (stat panel)
Model Prediction Time (time series)
Requests by Endpoint (bar chart)

Data Sources:

Prometheus: Real-time metrics
Pushgateway: Batch job metrics (drift detection)

HF Spaces Deployment

Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:

Service	Production URL
Prometheus	`https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/`
Grafana	`https://dacrow13-hopcroft-skill-classification.hf.space/grafana/`

This enables real-time monitoring of the production deployment without additional infrastructure.