Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / docs /design_choices.md

maurocarlu

nginx endpoints addition - grafana documentation update

70cbf15 22 days ago

preview code

raw

history blame contribute delete

15.4 kB

	# Design Choices

	Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards.

	---

	## Table of Contents

	1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering)
	2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines)
	3. [Quality Assurance](#3-quality-assurance)
	4. [API (Inference Service)](#4-api-inference-service)
	5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd)
	6. [Monitoring](#6-monitoring)

	---

	## 1. Inception (Requirements Engineering)

	### Machine Learning Canvas

	The project adopted the Machine Learning Canvas framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions.

	\| Canvas Section \| Application \|
	\|----------------\|-------------\|
	\| Prediction Task \| Multi-label classification of 217 technical skills from GitHub issue text \|
	\| Decisions \| Automated developer assignment based on predicted skill requirements \|
	\| Value Proposition \| Reduced issue resolution time, optimized resource allocation \|
	\| Data Sources \| SkillScope DB (7,245 PRs from 11 Java repositories) \|
	\| Making Predictions \| Real-time classification upon issue creation \|
	\| Building Models \| Iterative improvement over RF+TF-IDF baseline \|
	\| Monitoring \| Continuous evaluation with drift detection \|

	The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md).

	### Functional vs Non-Functional Requirements

	#### Functional Requirements

	\| Requirement \| Target \| Metric \|
	\|-------------\|--------\|--------\|
	\| Precision \| ≥ Baseline \| True positives / Predicted positives \|
	\| Recall \| ≥ Baseline \| True positives / Actual positives \|
	\| Micro-F1 \| > Baseline \| Harmonic mean across all labels \|
	\| Multi-label Support \| 217 skills \| Simultaneous prediction of multiple labels \|

	#### Non-Functional Requirements

	\| Category \| Requirement \| Implementation \|
	\|----------\|-------------\|----------------\|
	\| Reproducibility \| Auditable experiments \| MLflow tracking, DVC versioning \|
	\| Explainability \| Interpretable predictions \| Confidence scores per skill \|
	\| Performance \| Low latency inference \| FastAPI async, model caching \|
	\| Scalability \| Batch processing \| `/predict/batch` endpoint (max 100) \|
	\| Maintainability \| Clean code \| Ruff linting, type hints, docstrings \|

	### System-First vs Model-First Development

	The project adopted a System-First approach, prioritizing infrastructure and pipeline development before model optimization:

	```
	Timeline:
	┌─────────────────────────────────────────────────────────────┐
	│ Phase 1: Infrastructure │ Phase 2: Model Development │
	│ - DVC/MLflow setup │ - Feature engineering │
	│ - CI/CD pipeline │ - Hyperparameter tuning │
	│ - Docker containers │ - SMOTE/ADASYN experiments │
	│ - API skeleton │ - Performance optimization │
	└─────────────────────────────────────────────────────────────┘
	```

	Rationale:
	- Enables rapid iteration once infrastructure is stable
	- Ensures reproducibility from day one
	- Reduces technical debt during model development
	- Facilitates team collaboration with shared tooling

	---

	## 2. Reproducibility (Versioning & Pipelines)

	### Code Versioning (Git)

	Standard Git workflow with branch protection:

	\| Branch \| Purpose \|
	\|--------\|---------\|
	\| `main` \| Production-ready code \|
	\| `feature/*` \| New development \|
	\| `milestone/*` \| Grouping all features before merging into main \|

	### Data & Model Versioning (DVC)

	Design Decision: Use DVC (Data Version Control) with DagsHub remote storage for large file management.

	```
	.dvc/config
	├── remote: origin
	├── url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc
	└── auth: basic (credentials via environment)
	```

	Tracked Artifacts:

	\| File \| Purpose \|
	\|------\|---------\|
	\| `data/raw/skillscope_data.db` \| Original SQLite database \|
	\| `data/processed/*.npy` \| TF-IDF and embedding features \|
	\| `models/*.pkl` \| Trained models and vectorizers \|

	Versioning Workflow:
	```bash
	# Track new data
	dvc add data/raw/new_dataset.db
	git add data/raw/.gitignore data/raw/new_dataset.db.dvc

	# Push to remote
	dvc push
	git commit -m "Add new dataset version"
	git push
	```

	### Experiment Tracking (MLflow)

	Design Decision: Remote MLflow instance on DagsHub for collaborative experiment tracking.

	\| Configuration \| Value \|
	\|---------------\|-------\|
	\| Tracking URI \| `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` \|
	\| Experiments \| `skill_classification`, `skill_prediction_api` \|

	Logged Metrics:
	- Training: precision, recall, F1-score, training time
	- Inference: prediction latency, confidence scores, timestamps

	Artifact Storage:
	- Model binaries (`.pkl`)
	- Vectorizers and scalers
	- Hyperparameter configurations

	### Auditable ML Pipeline

	The pipeline is designed for complete reproducibility:

	```
	┌──────────────┐ ┌──────────────┐ ┌──────────────┐
	│ dataset.py │───▶│ features.py │───▶│ train.py │
	│ (DVC pull) │ │ (TF-IDF) │ │ (MLflow) │
	└──────────────┘ └──────────────┘ └──────────────┘
	│ │ │
	▼ ▼ ▼
	.dvc files .dvc files MLflow Run
	```

	---

	## 3. Quality Assurance

	### Testing Strategy

	#### Static Analysis (Ruff)

	Design Decision: Use Ruff as the primary linter for speed and comprehensive rule coverage.

	\| Configuration \| Value \|
	\|---------------\|-------\|
	\| Line Length \| 88 (Black compatible) \|
	\| Target Python \| 3.10+ \|
	\| Rule Sets \| PEP 8, isort, pyflakes \|

	CI Integration:
	```yaml
	- name: Lint with Ruff
	run: make lint
	```

	#### Dynamic Testing (Pytest)

	Test Organization:

	```
	tests/
	├── unit/ # Isolated function tests
	├── integration/ # Component interaction tests
	├── system/ # End-to-end tests
	├── behavioral/ # ML-specific tests
	├── deepchecks/ # Data validation
	└── great expectations/ # Schema validation
	```

	Markers for Selective Execution:
	```python
	@pytest.mark.unit
	@pytest.mark.integration
	@pytest.mark.system
	@pytest.mark.slow
	```

	### Model Validation vs Model Verification

	\| Concept \| Definition \| Implementation \|
	\|---------\|------------\|----------------\|
	\| Validation \| Does the model fit user needs? \| Micro-F1 vs baseline comparison \|
	\| Verification \| Is the model correctly built? \| Unit tests, behavioral tests \|

	### Behavioral Testing

	Design Decision: Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics.

	\| Test Type \| Count \| Purpose \|
	\|-----------\|-------\|---------\|
	\| Invariance \| 9 \| Stability under perturbations (typos, case changes) \|
	\| Directional \| 10 \| Expected behavior with keyword additions \|
	\| Minimum Functionality \| 17 \| Basic sanity checks on clear examples \|

	Example Invariance Test:
	```python
	def test_case_insensitivity():
	"""Model should predict same skills regardless of case."""
	assert predict("Fix BUG") == predict("fix bug")
	```

	### Data Quality Checks

	#### Great Expectations (10 Tests)

	Design Decision: Validate data at pipeline boundaries to catch quality issues early.

	\| Validation Point \| Tests \|
	\|------------------\|-------\|
	\| Raw Database \| Schema, row count, required columns \|
	\| Feature Matrix \| No NaN/Inf, sparsity, SMOTE compatibility \|
	\| Label Matrix \| Binary format, distribution, consistency \|
	\| Train/Test Split \| No leakage, stratification \|

	#### Deepchecks (24 Checks)

	Suites:
	- Data Integrity Suite (12 checks): Duplicates, nulls, correlations
	- Train-Test Validation Suite (12 checks): Leakage, drift, distribution

	Status: Production-ready (96% overall score)

	---

	## 4. API (Inference Service)

	### FastAPI Implementation

	Design Decision: Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation.

	Key Features:
	- Async lifespan management for model loading
	- Middleware for Prometheus metrics collection
	- Structured exception handling

	### RESTful Principles

	Design Decision: Follow REST best practices for intuitive API design.

	\| Principle \| Implementation \|
	\|-----------\|----------------\|
	\| Nouns, not verbs \| `/predictions` instead of `/getPrediction` \|
	\| Plural resources \| `/predictions`, `/issues` \|
	\| HTTP methods \| GET (retrieve), POST (create) \|
	\| Status codes \| 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) \|

	Endpoint Design:

	\| Method \| Endpoint \| Action \|
	\|--------\|----------\|--------\|
	\| `POST` \| `/predict` \| Create new prediction \|
	\| `POST` \| `/predict/batch` \| Create batch predictions \|
	\| `GET` \| `/predictions` \| List predictions \|
	\| `GET` \| `/predictions/{run_id}` \| Get specific prediction \|

	### OpenAPI/Swagger Documentation

	Auto-generated documentation at runtime:
	- Swagger UI: `/docs`
	- ReDoc: `/redoc`
	- OpenAPI JSON: `/openapi.json`

	Pydantic Models for Schema Enforcement:
	```python
	class IssueInput(BaseModel):
	issue_text: str
	repo_name: Optional[str] = None
	pr_number: Optional[int] = None

	class PredictionResponse(BaseModel):
	run_id: str
	predictions: List[SkillPrediction]
	model_version: str
	```

	---

	## 5. Deployment (Containerization & CI/CD)

	### Docker Containerization

	Design Decision: Multi-stage Docker builds with security best practices.

	Dockerfile Features:
	- Python 3.10 slim base image (minimal footprint)
	- Non-root user for security
	- DVC integration for model pulling
	- Health check endpoint configuration

	Multi-Service Architecture:

	```
	docker-compose.yml
	├── hopcroft-api (FastAPI)
	│ ├── Port: 8080
	│ ├── Volumes: source code, logs
	│ └── Health check: /health
	│
	├── hopcroft-gui (Streamlit)
	│ ├── Port: 8501
	│ ├── Depends on: hopcroft-api
	│ └── Environment: API_BASE_URL
	│
	└── hopcroft-net (Bridge network)
	```

	Design Rationale:
	- Separation of concerns (API vs GUI)
	- Independent scaling
	- Health-based dependency management
	- Shared network for internal communication

	### CI/CD Pipeline (GitHub Actions)

	Design Decision: Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds.

	Pipeline Stages:

	```yaml
	Jobs:
	unit-tests:
	- Checkout code
	- Setup Python 3.10
	- Install dependencies
	- Ruff linting
	- Pytest unit tests
	- Upload test report (on failure)

	build-image:
	- Needs: unit-tests
	- Configure DVC credentials
	- Pull models
	- Build Docker image
	```

	Triggers:
	- Push to `main`, `feature/*`
	- Pull requests to `main`

	Secrets Management:
	- `DAGSHUB_USERNAME`: DagsHub authentication
	- `DAGSHUB_TOKEN`: DagsHub access token

	### Hugging Face Spaces Hosting

	Design Decision: Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support.

	Configuration:
	```yaml
	---
	title: Hopcroft Skill Classification
	sdk: docker
	app_port: 7860
	---
	```

	Startup Flow:
	1. `start_space.sh` configures DVC credentials
	2. Pull models from DagsHub
	3. Start FastAPI (port 8000)
	4. Start Streamlit (port 8501)
	5. Start Nginx (port 7860) for routing

	Nginx Reverse Proxy:
	- `/` → Streamlit GUI
	- `/docs`, `/predict`, `/predictions` → FastAPI
	- `/prometheus` → Prometheus metrics

	---

	## 6. Monitoring

	### Resource-Level Monitoring

	Design Decision: Implement Prometheus metrics for real-time observability.

	\| Metric \| Type \| Purpose \|
	\|--------\|------\|---------\|
	\| `hopcroft_requests_total` \| Counter \| Request volume by endpoint \|
	\| `hopcroft_request_duration_seconds` \| Histogram \| Latency distribution (P50, P90, P99) \|
	\| `hopcroft_in_progress_requests` \| Gauge \| Concurrent request load \|
	\| `hopcroft_prediction_processing_seconds` \| Summary \| Model inference time \|

	Middleware Implementation:
	```python
	@app.middleware("http")
	async def monitor_requests(request, call_next):
	IN_PROGRESS.inc()
	with REQUEST_LATENCY.labels(method, endpoint).time():
	response = await call_next(request)
	REQUESTS_TOTAL.labels(method, endpoint, status).inc()
	IN_PROGRESS.dec()
	return response
	```

	### Performance-Level Monitoring

	Model Staleness Indicators:
	- Prediction confidence trends over time
	- Drift detection alerts
	- Error rate monitoring

	### Drift Detection Strategy

	Design Decision: Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction.

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Algorithm \| KS Two-Sample Test \|
	\| Baseline \| 1000 samples from training data \|
	\| Threshold \| p-value < 0.05 (Bonferroni corrected) \|
	\| Execution \| Scheduled via cron or manual trigger \|

	Drift Types Monitored:

	\| Type \| Definition \| Detection Method \|
	\|------\|------------\|------------------\|
	\| Data Drift \| Feature distribution shift \| KS test on input features \|
	\| Target Drift \| Label distribution shift \| Chi-square test on predictions \|
	\| Concept Drift \| Relationship change \| Performance degradation monitoring \|

	Metrics Published to Pushgateway:
	- `drift_detected`: Binary indicator (0/1)
	- `drift_p_value`: Statistical significance
	- `drift_distance`: KS distance metric
	- `drift_check_timestamp`: Last check time

	### Alerting Configuration

	Prometheus Alert Rules:

	\| Alert \| Condition \| Severity \|
	\|-------\|-----------\|----------\|
	\| `ServiceDown` \| Target down for 5m \| Critical \|
	\| `HighErrorRate` \| 5xx rate > 10% \| Warning \|
	\| `SlowRequests` \| P95 latency > 2s \| Warning \|
	\| `DriftDetected` \| drift_detected = 1 \| Warning \|

	Alertmanager Integration:
	- Severity-based routing
	- Email notifications
	- Inhibition rules to prevent alert storms

	### Grafana Visualization

	Dashboard Panels:
	1. Request Rate (gauge)
	2. Request Latency p50/p95 (time series)
	3. In-Progress Requests (stat panel)
	4. Error Rate 5xx (stat panel)
	5. Model Prediction Time (time series)
	6. Requests by Endpoint (bar chart)

	Data Sources:
	- Prometheus: Real-time metrics
	- Pushgateway: Batch job metrics (drift detection)

	### HF Spaces Deployment

	Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy:

	\| Service \| Production URL \|
	\|---------\|----------------\|
	\| Prometheus \| `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` \|
	\| Grafana \| `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` \|

	This enables real-time monitoring of the production deployment without additional infrastructure.