Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

File size: 15,374 Bytes

# Design Choices

Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.

---

## Table of Contents

1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering)
2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines)
3. [Quality Assurance](#3-quality-assurance)
4. [API (Inference Service)](#4-api-inference-service)
5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd)
6. [Monitoring](#6-monitoring)

---

## 1. Inception (Requirements Engineering)

### Machine Learning Canvas

The project adopted the **Machine Learning Canvas** framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.

| Canvas Section | Application |
|----------------|-------------|
| **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue text |
| **Decisions** | Automated developer assignment based on predicted skill requirements |
| **Value Proposition** | Reduced issue resolution time, optimized resource allocation |
| **Data Sources** | SkillScope DB (7,245 PRs from 11 Java repositories) |
| **Making Predictions** | Real-time classification upon issue creation |
| **Building Models** | Iterative improvement over RF+TF-IDF baseline |
| **Monitoring** | Continuous evaluation with drift detection |

The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md).

### Functional vs Non-Functional Requirements

#### Functional Requirements

| Requirement | Target | Metric |
|-------------|--------|--------|
| **Precision** | ≥ Baseline | True positives / Predicted positives |
| **Recall** | ≥ Baseline | True positives / Actual positives |
| **Micro-F1** | > Baseline | Harmonic mean across all labels |
| **Multi-label Support** | 217 skills | Simultaneous prediction of multiple labels |

#### Non-Functional Requirements

| Category | Requirement | Implementation |
|----------|-------------|----------------|
| **Reproducibility** | Auditable experiments | MLflow tracking, DVC versioning |
| **Explainability** | Interpretable predictions | Confidence scores per skill |
| **Performance** | Low latency inference | FastAPI async, model caching |
| **Scalability** | Batch processing | `/predict/batch` endpoint (max 100) |
| **Maintainability** | Clean code | Ruff linting, type hints, docstrings |

### System-First vs Model-First Development

The project adopted a **System-First** approach, prioritizing infrastructure and pipeline development before model optimization:

```
Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Infrastructure │ Phase 2: Model Development        │
│ - DVC/MLflow setup      │ - Feature engineering              │
│ - CI/CD pipeline        │ - Hyperparameter tuning            │
│ - Docker containers     │ - SMOTE/ADASYN experiments         │
│ - API skeleton          │ - Performance optimization         │
└─────────────────────────────────────────────────────────────┘
```

**Rationale:**
- Enables rapid iteration once infrastructure is stable
- Ensures reproducibility from day one
- Reduces technical debt during model development
- Facilitates team collaboration with shared tooling

---

## 2. Reproducibility (Versioning & Pipelines)

### Code Versioning (Git)

Standard Git workflow with branch protection:

| Branch | Purpose |
|--------|---------|
| `main` | Production-ready code |
| `feature/*` | New development |
| `milestone/*` | Grouping all features before merging into main |

### Data & Model Versioning (DVC)

**Design Decision:** Use DVC (Data Version Control) with DagsHub remote storage for large file management.

```
.dvc/config
├── remote: origin
├── url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
└── auth: basic (credentials via environment)
```

**Tracked Artifacts:**

| File | Purpose |
|------|---------|
| `data/raw/skillscope_data.db` | Original SQLite database |
| `data/processed/*.npy` | TF-IDF and embedding features |
| `models/*.pkl` | Trained models and vectorizers |

**Versioning Workflow:**
```bash
# Track new data
dvc add data/raw/new_dataset.db
git add data/raw/.gitignore data/raw/new_dataset.db.dvc

# Push to remote
dvc push
git commit -m "Add new dataset version"
git push
```

### Experiment Tracking (MLflow)

**Design Decision:** Remote MLflow instance on DagsHub for collaborative experiment tracking.

| Configuration | Value |
|---------------|-------|
| Tracking URI | `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` |
| Experiments | `skill_classification`, `skill_prediction_api` |

**Logged Metrics:**
- Training: precision, recall, F1-score, training time
- Inference: prediction latency, confidence scores, timestamps

**Artifact Storage:**
- Model binaries (`.pkl`)
- Vectorizers and scalers
- Hyperparameter configurations

### Auditable ML Pipeline

The pipeline is designed for complete reproducibility:

```
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   dataset.py │───▶│  features.py │───▶│   train.py   │
│   (DVC pull) │    │  (TF-IDF)    │    │  (MLflow)    │
└──────────────┘    └──────────────┘    └──────────────┘
       │                   │                   │
       ▼                   ▼                   ▼
    .dvc files         .dvc files          MLflow Run
```

---

## 3. Quality Assurance

### Testing Strategy

#### Static Analysis (Ruff)

**Design Decision:** Use Ruff as the primary linter for speed and comprehensive rule coverage.

| Configuration | Value |
|---------------|-------|
| Line Length | 88 (Black compatible) |
| Target Python | 3.10+ |
| Rule Sets | PEP 8, isort, pyflakes |

**CI Integration:**
```yaml
- name: Lint with Ruff
  run: make lint
```

#### Dynamic Testing (Pytest)

**Test Organization:**

```
tests/
├── unit/              # Isolated function tests
├── integration/       # Component interaction tests
├── system/            # End-to-end tests
├── behavioral/        # ML-specific tests
├── deepchecks/        # Data validation
└── great expectations/ # Schema validation
```

**Markers for Selective Execution:**
```python
@pytest.mark.unit
@pytest.mark.integration
@pytest.mark.system
@pytest.mark.slow
```

### Model Validation vs Model Verification

| Concept | Definition | Implementation |
|---------|------------|----------------|
| **Validation** | Does the model fit user needs? | Micro-F1 vs baseline comparison |
| **Verification** | Is the model correctly built? | Unit tests, behavioral tests |

### Behavioral Testing

**Design Decision:** Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.

| Test Type | Count | Purpose |
|-----------|-------|---------|
| **Invariance** | 9 | Stability under perturbations (typos, case changes) |
| **Directional** | 10 | Expected behavior with keyword additions |
| **Minimum Functionality** | 17 | Basic sanity checks on clear examples |

**Example Invariance Test:**
```python
def test_case_insensitivity():
    """Model should predict same skills regardless of case."""
    assert predict("Fix BUG") == predict("fix bug")
```

### Data Quality Checks

#### Great Expectations (10 Tests)

**Design Decision:** Validate data at pipeline boundaries to catch quality issues early.

| Validation Point | Tests |
|------------------|-------|
| Raw Database | Schema, row count, required columns |
| Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility |
| Label Matrix | Binary format, distribution, consistency |
| Train/Test Split | No leakage, stratification |

#### Deepchecks (24 Checks)

**Suites:**
- **Data Integrity Suite** (12 checks): Duplicates, nulls, correlations
- **Train-Test Validation Suite** (12 checks): Leakage, drift, distribution

**Status:** Production-ready (96% overall score)

---

## 4. API (Inference Service)

### FastAPI Implementation

**Design Decision:** Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.

**Key Features:**
- Async lifespan management for model loading
- Middleware for Prometheus metrics collection
- Structured exception handling

### RESTful Principles

**Design Decision:** Follow REST best practices for intuitive API design.

| Principle | Implementation |
|-----------|----------------|
| **Nouns, not verbs** | `/predictions` instead of `/getPrediction` |
| **Plural resources** | `/predictions`, `/issues` |
| **HTTP methods** | GET (retrieve), POST (create) |
| **Status codes** | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) |

**Endpoint Design:**

| Method | Endpoint | Action |
|--------|----------|--------|
| `POST` | `/predict` | Create new prediction |
| `POST` | `/predict/batch` | Create batch predictions |
| `GET` | `/predictions` | List predictions |
| `GET` | `/predictions/{run_id}` | Get specific prediction |

### OpenAPI/Swagger Documentation

**Auto-generated documentation at runtime:**
- Swagger UI: `/docs`
- ReDoc: `/redoc`
- OpenAPI JSON: `/openapi.json`

**Pydantic Models for Schema Enforcement:**
```python
class IssueInput(BaseModel):
    issue_text: str
    repo_name: Optional[str] = None
    pr_number: Optional[int] = None

class PredictionResponse(BaseModel):
    run_id: str
    predictions: List[SkillPrediction]
    model_version: str
```

---

## 5. Deployment (Containerization & CI/CD)

### Docker Containerization

**Design Decision:** Multi-stage Docker builds with security best practices.

**Dockerfile Features:**
- Python 3.10 slim base image (minimal footprint)
- Non-root user for security
- DVC integration for model pulling
- Health check endpoint configuration

**Multi-Service Architecture:**

```
docker-compose.yml
├── hopcroft-api (FastAPI)
│   ├── Port: 8080
│   ├── Volumes: source code, logs
│   └── Health check: /health
│
├── hopcroft-gui (Streamlit)
│   ├── Port: 8501
│   ├── Depends on: hopcroft-api
│   └── Environment: API_BASE_URL
│
└── hopcroft-net (Bridge network)
```

**Design Rationale:**
- Separation of concerns (API vs GUI)
- Independent scaling
- Health-based dependency management
- Shared network for internal communication

### CI/CD Pipeline (GitHub Actions)

**Design Decision:** Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.

**Pipeline Stages:**

```yaml
Jobs:
  unit-tests:
    - Checkout code
    - Setup Python 3.10
    - Install dependencies
    - Ruff linting
    - Pytest unit tests
    - Upload test report (on failure)

  build-image:
    - Needs: unit-tests
    - Configure DVC credentials
    - Pull models
    - Build Docker image
```

**Triggers:**
- Push to `main`, `feature/*`
- Pull requests to `main`

**Secrets Management:**
- `DAGSHUB_USERNAME`: DagsHub authentication
- `DAGSHUB_TOKEN`: DagsHub access token

### Hugging Face Spaces Hosting

**Design Decision:** Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.

**Configuration:**
```yaml
---
title: Hopcroft Skill Classification
sdk: docker
app_port: 7860
---
```

**Startup Flow:**
1. `start_space.sh` configures DVC credentials
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx (port 7860) for routing

**Nginx Reverse Proxy:**
- `/` → Streamlit GUI
- `/docs`, `/predict`, `/predictions` → FastAPI
- `/prometheus` → Prometheus metrics

---

## 6. Monitoring

### Resource-Level Monitoring

**Design Decision:** Implement Prometheus metrics for real-time observability.

| Metric | Type | Purpose |
|--------|------|---------|
| `hopcroft_requests_total` | Counter | Request volume by endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) |
| `hopcroft_in_progress_requests` | Gauge | Concurrent request load |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |

**Middleware Implementation:**
```python
@app.middleware("http")
async def monitor_requests(request, call_next):
    IN_PROGRESS.inc()
    with REQUEST_LATENCY.labels(method, endpoint).time():
        response = await call_next(request)
    REQUESTS_TOTAL.labels(method, endpoint, status).inc()
    IN_PROGRESS.dec()
    return response
```

### Performance-Level Monitoring

**Model Staleness Indicators:**
- Prediction confidence trends over time
- Drift detection alerts
- Error rate monitoring

### Drift Detection Strategy

**Design Decision:** Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.

| Component | Details |
|-----------|---------|
| **Algorithm** | KS Two-Sample Test |
| **Baseline** | 1000 samples from training data |
| **Threshold** | p-value < 0.05 (Bonferroni corrected) |
| **Execution** | Scheduled via cron or manual trigger |

**Drift Types Monitored:**

| Type | Definition | Detection Method |
|------|------------|------------------|
| **Data Drift** | Feature distribution shift | KS test on input features |
| **Target Drift** | Label distribution shift | Chi-square test on predictions |
| **Concept Drift** | Relationship change | Performance degradation monitoring |

**Metrics Published to Pushgateway:**
- `drift_detected`: Binary indicator (0/1)
- `drift_p_value`: Statistical significance
- `drift_distance`: KS distance metric
- `drift_check_timestamp`: Last check time

### Alerting Configuration

**Prometheus Alert Rules:**

| Alert | Condition | Severity |
|-------|-----------|----------|
| `ServiceDown` | Target down for 5m | Critical |
| `HighErrorRate` | 5xx rate > 10% | Warning |
| `SlowRequests` | P95 latency > 2s | Warning |
| `DriftDetected` | drift_detected = 1 | Warning |

**Alertmanager Integration:**
- Severity-based routing
- Email notifications
- Inhibition rules to prevent alert storms

### Grafana Visualization

**Dashboard Panels:**
1. Request Rate (gauge)
2. Request Latency p50/p95 (time series)
3. In-Progress Requests (stat panel)
4. Error Rate 5xx (stat panel)
5. Model Prediction Time (time series)
6. Requests by Endpoint (bar chart)

**Data Sources:**
- Prometheus: Real-time metrics
- Pushgateway: Batch job metrics (drift detection)

### HF Spaces Deployment

Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:

| Service | Production URL |
|---------|----------------|
| Prometheus | `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` |
| Grafana | `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` |

This enables real-time monitoring of the production deployment without additional infrastructure.