| # Design Choices | |
| Technical justification of the architectural and engineering decisions made during the Hopcroft project development, following professional MLOps and Software Engineering standards. | |
| --- | |
| ## Table of Contents | |
| 1. [Inception (Requirements Engineering)](#1-inception-requirements-engineering) | |
| 2. [Reproducibility (Versioning & Pipelines)](#2-reproducibility-versioning--pipelines) | |
| 3. [Quality Assurance](#3-quality-assurance) | |
| 4. [API (Inference Service)](#4-api-inference-service) | |
| 5. [Deployment (Containerization & CI/CD)](#5-deployment-containerization--cicd) | |
| 6. [Monitoring](#6-monitoring) | |
| --- | |
| ## 1. Inception (Requirements Engineering) | |
| ### Machine Learning Canvas | |
| The project adopted the **Machine Learning Canvas** framework to systematically define the problem space before implementation. This structured approach ensures alignment between business objectives and technical solutions. | |
| | Canvas Section | Application | | |
| |----------------|-------------| | |
| | **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue text | | |
| | **Decisions** | Automated developer assignment based on predicted skill requirements | | |
| | **Value Proposition** | Reduced issue resolution time, optimized resource allocation | | |
| | **Data Sources** | SkillScope DB (7,245 PRs from 11 Java repositories) | | |
| | **Making Predictions** | Real-time classification upon issue creation | | |
| | **Building Models** | Iterative improvement over RF+TF-IDF baseline | | |
| | **Monitoring** | Continuous evaluation with drift detection | | |
| The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md). | |
| ### Functional vs Non-Functional Requirements | |
| #### Functional Requirements | |
| | Requirement | Target | Metric | | |
| |-------------|--------|--------| | |
| | **Precision** | β₯ Baseline | True positives / Predicted positives | | |
| | **Recall** | β₯ Baseline | True positives / Actual positives | | |
| | **Micro-F1** | > Baseline | Harmonic mean across all labels | | |
| | **Multi-label Support** | 217 skills | Simultaneous prediction of multiple labels | | |
| #### Non-Functional Requirements | |
| | Category | Requirement | Implementation | | |
| |----------|-------------|----------------| | |
| | **Reproducibility** | Auditable experiments | MLflow tracking, DVC versioning | | |
| | **Explainability** | Interpretable predictions | Confidence scores per skill | | |
| | **Performance** | Low latency inference | FastAPI async, model caching | | |
| | **Scalability** | Batch processing | `/predict/batch` endpoint (max 100) | | |
| | **Maintainability** | Clean code | Ruff linting, type hints, docstrings | | |
| ### System-First vs Model-First Development | |
| The project adopted a **System-First** approach, prioritizing infrastructure and pipeline development before model optimization: | |
| ``` | |
| Timeline: | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Phase 1: Infrastructure β Phase 2: Model Development β | |
| β - DVC/MLflow setup β - Feature engineering β | |
| β - CI/CD pipeline β - Hyperparameter tuning β | |
| β - Docker containers β - SMOTE/ADASYN experiments β | |
| β - API skeleton β - Performance optimization β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Rationale:** | |
| - Enables rapid iteration once infrastructure is stable | |
| - Ensures reproducibility from day one | |
| - Reduces technical debt during model development | |
| - Facilitates team collaboration with shared tooling | |
| --- | |
| ## 2. Reproducibility (Versioning & Pipelines) | |
| ### Code Versioning (Git) | |
| Standard Git workflow with branch protection: | |
| | Branch | Purpose | | |
| |--------|---------| | |
| | `main` | Production-ready code | | |
| | `feature/*` | New development | | |
| | `milestone/*` | Grouping all features before merging into main | | |
| ### Data & Model Versioning (DVC) | |
| **Design Decision:** Use DVC (Data Version Control) with DagsHub remote storage for large file management. | |
| ``` | |
| .dvc/config | |
| βββ remote: origin | |
| βββ url: https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc | |
| βββ auth: basic (credentials via environment) | |
| ``` | |
| **Tracked Artifacts:** | |
| | File | Purpose | | |
| |------|---------| | |
| | `data/raw/skillscope_data.db` | Original SQLite database | | |
| | `data/processed/*.npy` | TF-IDF and embedding features | | |
| | `models/*.pkl` | Trained models and vectorizers | | |
| **Versioning Workflow:** | |
| ```bash | |
| # Track new data | |
| dvc add data/raw/new_dataset.db | |
| git add data/raw/.gitignore data/raw/new_dataset.db.dvc | |
| # Push to remote | |
| dvc push | |
| git commit -m "Add new dataset version" | |
| git push | |
| ``` | |
| ### Experiment Tracking (MLflow) | |
| **Design Decision:** Remote MLflow instance on DagsHub for collaborative experiment tracking. | |
| | Configuration | Value | | |
| |---------------|-------| | |
| | Tracking URI | `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` | | |
| | Experiments | `skill_classification`, `skill_prediction_api` | | |
| **Logged Metrics:** | |
| - Training: precision, recall, F1-score, training time | |
| - Inference: prediction latency, confidence scores, timestamps | |
| **Artifact Storage:** | |
| - Model binaries (`.pkl`) | |
| - Vectorizers and scalers | |
| - Hyperparameter configurations | |
| ### Auditable ML Pipeline | |
| The pipeline is designed for complete reproducibility: | |
| ``` | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β dataset.py βββββΆβ features.py βββββΆβ train.py β | |
| β (DVC pull) β β (TF-IDF) β β (MLflow) β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β β β | |
| βΌ βΌ βΌ | |
| .dvc files .dvc files MLflow Run | |
| ``` | |
| --- | |
| ## 3. Quality Assurance | |
| ### Testing Strategy | |
| #### Static Analysis (Ruff) | |
| **Design Decision:** Use Ruff as the primary linter for speed and comprehensive rule coverage. | |
| | Configuration | Value | | |
| |---------------|-------| | |
| | Line Length | 88 (Black compatible) | | |
| | Target Python | 3.10+ | | |
| | Rule Sets | PEP 8, isort, pyflakes | | |
| **CI Integration:** | |
| ```yaml | |
| - name: Lint with Ruff | |
| run: make lint | |
| ``` | |
| #### Dynamic Testing (Pytest) | |
| **Test Organization:** | |
| ``` | |
| tests/ | |
| βββ unit/ # Isolated function tests | |
| βββ integration/ # Component interaction tests | |
| βββ system/ # End-to-end tests | |
| βββ behavioral/ # ML-specific tests | |
| βββ deepchecks/ # Data validation | |
| βββ great expectations/ # Schema validation | |
| ``` | |
| **Markers for Selective Execution:** | |
| ```python | |
| @pytest.mark.unit | |
| @pytest.mark.integration | |
| @pytest.mark.system | |
| @pytest.mark.slow | |
| ``` | |
| ### Model Validation vs Model Verification | |
| | Concept | Definition | Implementation | | |
| |---------|------------|----------------| | |
| | **Validation** | Does the model fit user needs? | Micro-F1 vs baseline comparison | | |
| | **Verification** | Is the model correctly built? | Unit tests, behavioral tests | | |
| ### Behavioral Testing | |
| **Design Decision:** Implement CheckList-inspired behavioral tests to evaluate model robustness beyond accuracy metrics. | |
| | Test Type | Count | Purpose | | |
| |-----------|-------|---------| | |
| | **Invariance** | 9 | Stability under perturbations (typos, case changes) | | |
| | **Directional** | 10 | Expected behavior with keyword additions | | |
| | **Minimum Functionality** | 17 | Basic sanity checks on clear examples | | |
| **Example Invariance Test:** | |
| ```python | |
| def test_case_insensitivity(): | |
| """Model should predict same skills regardless of case.""" | |
| assert predict("Fix BUG") == predict("fix bug") | |
| ``` | |
| ### Data Quality Checks | |
| #### Great Expectations (10 Tests) | |
| **Design Decision:** Validate data at pipeline boundaries to catch quality issues early. | |
| | Validation Point | Tests | | |
| |------------------|-------| | |
| | Raw Database | Schema, row count, required columns | | |
| | Feature Matrix | No NaN/Inf, sparsity, SMOTE compatibility | | |
| | Label Matrix | Binary format, distribution, consistency | | |
| | Train/Test Split | No leakage, stratification | | |
| #### Deepchecks (24 Checks) | |
| **Suites:** | |
| - **Data Integrity Suite** (12 checks): Duplicates, nulls, correlations | |
| - **Train-Test Validation Suite** (12 checks): Leakage, drift, distribution | |
| **Status:** Production-ready (96% overall score) | |
| --- | |
| ## 4. API (Inference Service) | |
| ### FastAPI Implementation | |
| **Design Decision:** Use FastAPI for async request handling, automatic OpenAPI generation, and native Pydantic validation. | |
| **Key Features:** | |
| - Async lifespan management for model loading | |
| - Middleware for Prometheus metrics collection | |
| - Structured exception handling | |
| ### RESTful Principles | |
| **Design Decision:** Follow REST best practices for intuitive API design. | |
| | Principle | Implementation | | |
| |-----------|----------------| | |
| | **Nouns, not verbs** | `/predictions` instead of `/getPrediction` | | |
| | **Plural resources** | `/predictions`, `/issues` | | |
| | **HTTP methods** | GET (retrieve), POST (create) | | |
| | **Status codes** | 200 (OK), 201 (Created), 404 (Not Found), 500 (Error) | | |
| **Endpoint Design:** | |
| | Method | Endpoint | Action | | |
| |--------|----------|--------| | |
| | `POST` | `/predict` | Create new prediction | | |
| | `POST` | `/predict/batch` | Create batch predictions | | |
| | `GET` | `/predictions` | List predictions | | |
| | `GET` | `/predictions/{run_id}` | Get specific prediction | | |
| ### OpenAPI/Swagger Documentation | |
| **Auto-generated documentation at runtime:** | |
| - Swagger UI: `/docs` | |
| - ReDoc: `/redoc` | |
| - OpenAPI JSON: `/openapi.json` | |
| **Pydantic Models for Schema Enforcement:** | |
| ```python | |
| class IssueInput(BaseModel): | |
| issue_text: str | |
| repo_name: Optional[str] = None | |
| pr_number: Optional[int] = None | |
| class PredictionResponse(BaseModel): | |
| run_id: str | |
| predictions: List[SkillPrediction] | |
| model_version: str | |
| ``` | |
| --- | |
| ## 5. Deployment (Containerization & CI/CD) | |
| ### Docker Containerization | |
| **Design Decision:** Multi-stage Docker builds with security best practices. | |
| **Dockerfile Features:** | |
| - Python 3.10 slim base image (minimal footprint) | |
| - Non-root user for security | |
| - DVC integration for model pulling | |
| - Health check endpoint configuration | |
| **Multi-Service Architecture:** | |
| ``` | |
| docker-compose.yml | |
| βββ hopcroft-api (FastAPI) | |
| β βββ Port: 8080 | |
| β βββ Volumes: source code, logs | |
| β βββ Health check: /health | |
| β | |
| βββ hopcroft-gui (Streamlit) | |
| β βββ Port: 8501 | |
| β βββ Depends on: hopcroft-api | |
| β βββ Environment: API_BASE_URL | |
| β | |
| βββ hopcroft-net (Bridge network) | |
| ``` | |
| **Design Rationale:** | |
| - Separation of concerns (API vs GUI) | |
| - Independent scaling | |
| - Health-based dependency management | |
| - Shared network for internal communication | |
| ### CI/CD Pipeline (GitHub Actions) | |
| **Design Decision:** Implement Continuous Delivery for ML (CD4ML) with automated testing and image builds. | |
| **Pipeline Stages:** | |
| ```yaml | |
| Jobs: | |
| unit-tests: | |
| - Checkout code | |
| - Setup Python 3.10 | |
| - Install dependencies | |
| - Ruff linting | |
| - Pytest unit tests | |
| - Upload test report (on failure) | |
| build-image: | |
| - Needs: unit-tests | |
| - Configure DVC credentials | |
| - Pull models | |
| - Build Docker image | |
| ``` | |
| **Triggers:** | |
| - Push to `main`, `feature/*` | |
| - Pull requests to `main` | |
| **Secrets Management:** | |
| - `DAGSHUB_USERNAME`: DagsHub authentication | |
| - `DAGSHUB_TOKEN`: DagsHub access token | |
| ### Hugging Face Spaces Hosting | |
| **Design Decision:** Deploy on HF Spaces for free GPU-enabled hosting with Docker SDK support. | |
| **Configuration:** | |
| ```yaml | |
| --- | |
| title: Hopcroft Skill Classification | |
| sdk: docker | |
| app_port: 7860 | |
| --- | |
| ``` | |
| **Startup Flow:** | |
| 1. `start_space.sh` configures DVC credentials | |
| 2. Pull models from DagsHub | |
| 3. Start FastAPI (port 8000) | |
| 4. Start Streamlit (port 8501) | |
| 5. Start Nginx (port 7860) for routing | |
| **Nginx Reverse Proxy:** | |
| - `/` β Streamlit GUI | |
| - `/docs`, `/predict`, `/predictions` β FastAPI | |
| - `/prometheus` β Prometheus metrics | |
| --- | |
| ## 6. Monitoring | |
| ### Resource-Level Monitoring | |
| **Design Decision:** Implement Prometheus metrics for real-time observability. | |
| | Metric | Type | Purpose | | |
| |--------|------|---------| | |
| | `hopcroft_requests_total` | Counter | Request volume by endpoint | | |
| | `hopcroft_request_duration_seconds` | Histogram | Latency distribution (P50, P90, P99) | | |
| | `hopcroft_in_progress_requests` | Gauge | Concurrent request load | | |
| | `hopcroft_prediction_processing_seconds` | Summary | Model inference time | | |
| **Middleware Implementation:** | |
| ```python | |
| @app.middleware("http") | |
| async def monitor_requests(request, call_next): | |
| IN_PROGRESS.inc() | |
| with REQUEST_LATENCY.labels(method, endpoint).time(): | |
| response = await call_next(request) | |
| REQUESTS_TOTAL.labels(method, endpoint, status).inc() | |
| IN_PROGRESS.dec() | |
| return response | |
| ``` | |
| ### Performance-Level Monitoring | |
| **Model Staleness Indicators:** | |
| - Prediction confidence trends over time | |
| - Drift detection alerts | |
| - Error rate monitoring | |
| ### Drift Detection Strategy | |
| **Design Decision:** Implement statistical drift detection using Kolmogorov-Smirnov test with Bonferroni correction. | |
| | Component | Details | | |
| |-----------|---------| | |
| | **Algorithm** | KS Two-Sample Test | | |
| | **Baseline** | 1000 samples from training data | | |
| | **Threshold** | p-value < 0.05 (Bonferroni corrected) | | |
| | **Execution** | Scheduled via cron or manual trigger | | |
| **Drift Types Monitored:** | |
| | Type | Definition | Detection Method | | |
| |------|------------|------------------| | |
| | **Data Drift** | Feature distribution shift | KS test on input features | | |
| | **Target Drift** | Label distribution shift | Chi-square test on predictions | | |
| | **Concept Drift** | Relationship change | Performance degradation monitoring | | |
| **Metrics Published to Pushgateway:** | |
| - `drift_detected`: Binary indicator (0/1) | |
| - `drift_p_value`: Statistical significance | |
| - `drift_distance`: KS distance metric | |
| - `drift_check_timestamp`: Last check time | |
| ### Alerting Configuration | |
| **Prometheus Alert Rules:** | |
| | Alert | Condition | Severity | | |
| |-------|-----------|----------| | |
| | `ServiceDown` | Target down for 5m | Critical | | |
| | `HighErrorRate` | 5xx rate > 10% | Warning | | |
| | `SlowRequests` | P95 latency > 2s | Warning | | |
| | `DriftDetected` | drift_detected = 1 | Warning | | |
| **Alertmanager Integration:** | |
| - Severity-based routing | |
| - Email notifications | |
| - Inhibition rules to prevent alert storms | |
| ### Grafana Visualization | |
| **Dashboard Panels:** | |
| 1. Request Rate (gauge) | |
| 2. Request Latency p50/p95 (time series) | |
| 3. In-Progress Requests (stat panel) | |
| 4. Error Rate 5xx (stat panel) | |
| 5. Model Prediction Time (time series) | |
| 6. Requests by Endpoint (bar chart) | |
| **Data Sources:** | |
| - Prometheus: Real-time metrics | |
| - Pushgateway: Batch job metrics (drift detection) | |
| ### HF Spaces Deployment | |
| Both Prometheus and Grafana are deployed on Hugging Face Spaces via Nginx reverse proxy: | |
| | Service | Production URL | | |
| |---------|----------------| | |
| | Prometheus | `https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/` | | |
| | Grafana | `https://dacrow13-hopcroft-skill-classification.hf.space/grafana/` | | |
| This enables real-time monitoring of the production deployment without additional infrastructure. | |