Design Choices
Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.
Table of Contents
- Inception (Requirements Engineering)
- Reproducibility (Versioning & Pipelines)
- Quality Assurance
- API (Inference Service)
- Deployment (Containerization & CI/CD)
- Monitoring
1. Inception (Requirements Engineering)
Machine Learning Canvas
The project adopted the Machine Learning Canvas framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.
| Canvas Section | Application |
|---|---|
| Prediction Task | Multi-label classification of 217 technical skills from GitHub issue text |
| Decisions | Automated developer assignment based on predicted skill requirements |
| Value Proposition | Reduced issue resolution time, optimized resource allocation |
| Data Sources | SkillScope DB (7,245 PRs from 11 Java repositories) |
| Making Predictions | Real-time classification upon issue creation |
| Building Models | Iterative improvement over RF+TF-IDF baseline |
| Monitoring | Continuous evaluation with drift detection |
The complete ML Canvas is documented in ML Canvas.md.
Functional vs Non-Functional Requirements
Functional Requirements
| Requirement | Target | Metric |
|---|---|---|
| Precision | β₯ Baseline | True positives / Predicted positives |
| Recall | β₯ Baseline | True positives / Actual positives |
| Micro-F1 | > Baseline | Harmonic mean across all labels |
| Multi-label Support | 217 skills | Simultaneous prediction of multiple labels |
Non-Functional Requirements
| Category | Requirement | Implementation |
|---|---|---|
| Reproducibility | Auditable experiments | MLflow tracking, DVC versioning |
| Explainability | Interpretable predictions | Confidence scores per skill |
| Performance | Low latency inference | FastAPI async, model caching |
| Scalability | Batch processing | /predict/batch endpoint (max 100) |
| Maintainability | Clean code | Ruff linting, type hints, docstrings |
System-First vs Model-First Development
The project adopted a System-First approach, prioritizing infrastructure and pipeline development before model optimization:
Timeline:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1: Infrastructure β Phase 2: Model Development β
β - DVC/MLflow setup β - Feature engineering β
β - CI/CD pipeline β - Hyperparameter tuning β
β - Docker containers β - SMOTE/ADASYN experiments β
β - API skeleton β - Performance optimization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Rationale:
- Enables rapid iteration once infrastructure is stable
- Ensures reproducibility from day one
- Reduces technical debt during model development
- Facilitates team collaboration with shared tooling
2. Reproducibility (Versioning & Pipelines)
Code Versioning (Git)
Standard Git workflow with branch protection:
| Branch | Purpose |
|---|---|
main |
Production-ready code |
feature/* |
New development |
milestone/* |
Grouping all features before merging into main |
Data & Model Versioning (DVC)
Design Decision: Use DVC (Data Version Control) with DagsHub remote storage for large file management.
.dvc/config
βββ remote: origin
βββ url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
βββ auth: basic (credentials via environment)
Tracked Artifacts:
| File | Purpose |
|---|---|
data/raw/skillscope_data.db |
Original SQLite database |
data/processed/*.npy |
TF-IDF and embedding features |
models/*.pkl |
Trained models and vectorizers |
Versioning Workflow:
# Track new data
dvc add data/raw/new_dataset.db
git add data/raw/.gitignore data/raw/new_dataset.db.dvc
# Push to remote
dvc push
git commit -m "Add new dataset version"
git push
Experiment Tracking (MLflow)
Design Decision: Remote MLflow instance on DagsHub for collaborative experiment tracking.
| Configuration | Value |
|---|---|
| Tracking URI | https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow |
| Experiments | skill_classification, skill_prediction_api |
Logged Metrics:
- Training: precision, recall, F1-score, training time
- Inference: prediction latency, confidence scores, timestamps
Artifact Storage:
- Model binaries (
.pkl) - Vectorizers and scalers
- Hyperparameter configurations
Auditable ML Pipeline
The pipeline is designed for complete reproducibility:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β dataset.py βββββΆβ features.py βββββΆβ train.py β
β (DVC pull) β β (TF-IDF) β β (MLflow) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
βΌ βΌ βΌ
.dvc files .dvc files MLflow Run
3. Quality Assurance
Testing Strategy
Static Analysis (Ruff)
Design Decision: Use Ruff as the primary linter for speed and comprehensive rule coverage.
| Configuration | Value |
|---|---|
| Line Length | 88 (Black compatible) |
| Target Python | 3.10+ |
| Rule Sets | PEP 8, isort, pyflakes |
CI Integration:
- name: Lint with Ruff
run: make lint
Dynamic Testing (Pytest)
Test Organization:
tests/
βββ unit/ # Isolated function tests
βββ integration/ # Component interaction tests
βββ system/ # End-to-end tests
βββ behavioral/ # ML-specific tests
βββ deepchecks/ # Data validation
βββ great expectations/ # Schema validation
Markers for Selective Execution:
@pytest.mark.unit
@pytest.mark.integration
@pytest.mark.system
@pytest.mark.slow
Model Validation vs Model Verification
| Concept | Definition | Implementation |
|---|---|---|
| Validation | Does the model fit user needs? | Micro-F1 vs baseline comparison |
| Verification | Is the model correctly built? | Unit tests, behavioral tests |
Behavioral Testing
Design Decision: Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.
| Test Type | Count | Purpose |
|---|---|---|
| Invariance | 9 | Stability under perturbations (typos, case changes) |
| Directional | 10 | Expected behavior with keyword additions |
| Minimum Functionality | 17 | Basic sanity checks on clear examples |
Example Invariance Test:
def test_case_insensitivity():
"""Model should predict same skills regardless of case."""
assert predict("Fix BUG") == predict("fix bug")
Data Quality Checks
Great Expectations (10 Tests)
Design Decision: Validate data at pipeline boundaries to catch quality issues early.
| Validation Point | Tests |
|---|---|
| Raw Database | Schema, row count, required columns |
| Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility |
| Label Matrix | Binary format, distribution, consistency |
| Train/Test Split | No leakage, stratification |
Deepchecks (24 Checks)
Suites:
- Data Integrity Suite (12 checks): Duplicates, nulls, correlations
- Train-Test Validation Suite (12 checks): Leakage, drift, distribution
Status: Production-ready (96% overall score)
4. API (Inference Service)
FastAPI Implementation
Design Decision: Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.
Key Features:
- Async lifespan management for model loading
- Middleware for Prometheus metrics collection
- Structured exception handling
RESTful Principles
Design Decision: Follow REST best practices for intuitive API design.
| Principle | Implementation |
|---|---|
| Nouns, not verbs | /predictions instead of /getPrediction |
| Plural resources | /predictions, /issues |
| HTTP methods | GET (retrieve), POST (create) |
| Status codes | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) |
Endpoint Design:
| Method | Endpoint | Action |
|---|---|---|
POST |
/predict |
Create new prediction |
POST |
/predict/batch |
Create batch predictions |
GET |
/predictions |
List predictions |
GET |
/predictions/{run_id} |
Get specific prediction |
OpenAPI/Swagger Documentation
Auto-generated documentation at runtime:
- Swagger UI:
/docs - ReDoc:
/redoc - OpenAPI JSON:
/openapi.json
Pydantic Models for Schema Enforcement:
class IssueInput(BaseModel):
issue_text: str
repo_name: Optional[str] = None
pr_number: Optional[int] = None
class PredictionResponse(BaseModel):
run_id: str
predictions: List[SkillPrediction]
model_version: str
5. Deployment (Containerization & CI/CD)
Docker Containerization
Design Decision: Multi-stage Docker builds with security best practices.
Dockerfile Features:
- Python 3.10 slim base image (minimal footprint)
- Non-root user for security
- DVC integration for model pulling
- Health check endpoint configuration
Multi-Service Architecture:
docker-compose.yml
βββ hopcroft-api (FastAPI)
β βββ Port: 8080
β βββ Volumes: source code, logs
β βββ Health check: /health
β
βββ hopcroft-gui (Streamlit)
β βββ Port: 8501
β βββ Depends on: hopcroft-api
β βββ Environment: API_BASE_URL
β
βββ hopcroft-net (Bridge network)
Design Rationale:
- Separation of concerns (API vs GUI)
- Independent scaling
- Health-based dependency management
- Shared network for internal communication
CI/CD Pipeline (GitHub Actions)
Design Decision: Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.
Pipeline Stages:
Jobs:
unit-tests:
- Checkout code
- Setup Python 3.10
- Install dependencies
- Ruff linting
- Pytest unit tests
- Upload test report (on failure)
build-image:
- Needs: unit-tests
- Configure DVC credentials
- Pull models
- Build Docker image
Triggers:
- Push to
main,feature/* - Pull requests to
main
Secrets Management:
DAGSHUB_USERNAME: DagsHub authenticationDAGSHUB_TOKEN: DagsHub access token
Hugging Face Spaces Hosting
Design Decision: Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.
Configuration: ```yaml
title: Hopcroft Skill Classification sdk: docker app_port: 7860
**Startup Flow:**
1. `start_space.sh` configures DVC credentials
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx (port 7860) for routing
**Nginx Reverse Proxy:**
- `/` β Streamlit GUI
- `/docs`, `/predict`, `/predictions` β FastAPI
- `/prometheus` β Prometheus metrics
---
## 6. Monitoring
### Resource-Level Monitoring
**Design Decision:** Implement Prometheus metrics for real-time observability.
| Metric | Type | Purpose |
|--------|------|---------|
| `hopcroft_requests_total` | Counter | Request volume by endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) |
| `hopcroft_in_progress_requests` | Gauge | Concurrent request load |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |
**Middleware Implementation:**
```python
@app.middleware("http")
async def monitor_requests(request, call_next):
IN_PROGRESS.inc()
with REQUEST_LATENCY.labels(method, endpoint).time():
response = await call_next(request)
REQUESTS_TOTAL.labels(method, endpoint, status).inc()
IN_PROGRESS.dec()
return response
Performance-Level Monitoring
Model Staleness Indicators:
- Prediction confidence trends over time
- Drift detection alerts
- Error rate monitoring
Drift Detection Strategy
Design Decision: Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.
| Component | Details |
|---|---|
| Algorithm | KS Two-Sample Test |
| Baseline | 1000 samples from training data |
| Threshold | p-value < 0.05 (Bonferroni corrected) |
| Execution | Scheduled via cron or manual trigger |
Drift Types Monitored:
| Type | Definition | Detection Method |
|---|---|---|
| Data Drift | Feature distribution shift | KS test on input features |
| Target Drift | Label distribution shift | Chi-square test on predictions |
| Concept Drift | Relationship change | Performance degradation monitoring |
Metrics Published to Pushgateway:
drift_detected: Binary indicator (0/1)drift_p_value: Statistical significancedrift_distance: KS distance metricdrift_check_timestamp: Last check time
Alerting Configuration
Prometheus Alert Rules:
| Alert | Condition | Severity |
|---|---|---|
ServiceDown |
Target down for 5m | Critical |
HighErrorRate |
5xx rate > 10% | Warning |
SlowRequests |
P95 latency > 2s | Warning |
DriftDetected |
drift_detected = 1 | Warning |
Alertmanager Integration:
- Severity-based routing
- Email notifications
- Inhibition rules to prevent alert storms
Grafana Visualization
Dashboard Panels:
- Request Rate (gauge)
- Request Latency p50/p95 (time series)
- In-Progress Requests (stat panel)
- Error Rate 5xx (stat panel)
- Model Prediction Time (time series)
- Requests by Endpoint (bar chart)
Data Sources:
- Prometheus: Real-time metrics
- Pushgateway: Batch job metrics (drift detection)
HF Spaces Deployment
Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:
| Service | Production URL |
|---|---|
| Prometheus | https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/ |
| Grafana | https://dacrow13-hopcroft-skill-classification.hf.space/grafana/ |
This enables real-time monitoring of the production deployment without additional infrastructure.