Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

File size: 9,146 Bytes

# Milestone Summaries

This document provides a comprehensive overview of all six project milestones, documenting the evolution of the Hopcroft Skill Classification system from requirements engineering through production monitoring.

---

## Milestone 1: Requirements Engineering

**Objective:** Define the problem space, stakeholders, and success criteria using the Machine Learning Canvas framework.

### Key Deliverables

| Component | Description |
|-----------|-------------|
| **Prediction Task** | Multi-label classification of 217 technical skills from GitHub issue/PR text |
| **Stakeholders** | Project managers, team leads, developers |
| **Data Source** | SkillScope DB with 7,245 merged PRs from 11 Java repositories |
| **Success Metrics** | Micro-F1 score improvement over baseline, precision/recall balance |

### ML Canvas Framework

The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md), covering:

- **Value Proposition**: Automated task assignment optimization
- **Decisions**: Resource allocation for issue resolution
- **Data Collection**: Automated labeling via API call detection
- **Impact Simulation**: Outperform SkillScope RF + TF-IDF baseline
- **Monitoring**: Continuous evaluation with drift detection

### Identified Risks & Mitigations

| Risk | Mitigation Strategy |
|------|---------------------|
| Label imbalance (217 classes) | SMOTE, MLSMOTE, ADASYN oversampling |
| Text noise (URLs, HTML, code) | Custom preprocessing pipeline |
| Multi-label complexity | MultiOutputClassifier with stratified splits |

---

## Milestone 2: Data Management & Experiment Tracking

**Objective:** Establish end-to-end infrastructure for reproducible ML experiments.

### Data Pipeline

```
data/raw/           → dataset.py       → data/processed/
(SkillScope SQLite)   (HuggingFace)       (Clean CSV)
                           ↓
                      features.py
                           ↓
                    data/processed/
                    (TF-IDF/Embeddings)
```

### Key Components

1. **Data Management**
   - DVC setup with DagsHub remote storage
   - Git-ignored data and model directories
   - Version-controlled `.dvc` files for reproducibility

2. **Data Ingestion**
   - `dataset.py`: Downloads SkillScope from Hugging Face
   - Extracts SQLite database with cleanup

3. **Feature Engineering**
   - `features.py`: Text cleaning pipeline
     - URL/HTML/Markdown removal
     - Normalization and Porter stemming
     - TF-IDF vectorization (uni+bi-grams)
     - Sentence embedding generation

4. **Configuration**
   - `config.py`: Centralized paths, hyperparameters, MLflow URI

5. **Experiment Tracking**
   - MLflow with DagsHub remote
   - Logged metrics: precision, recall, F1-score
   - Artifact storage: models, vectorizers, scalers

### Training Actions

| Action | Description |
|--------|-------------|
| `baseline` | Random Forest with TF-IDF |
| `mlsmote` | Multi-label SMOTE oversampling |
| `ros` | Random Oversampling |
| `adasyn-pca` | ADASYN + PCA dimensionality reduction |
| `lightgbm` | LightGBM classifier |

---

## Milestone 3: Quality Assurance

**Objective:** Implement comprehensive testing and validation framework for data quality and model robustness.

### Data Cleaning Pipeline

| Metric | Before | After | Resolution |
|--------|--------|-------|------------|
| Total Samples | 7,154 | 6,673 | -481 duplicates |
| Duplicates | 481 | 0 | Exact match removal |
| Label Conflicts | 640 | 0 | Majority voting |
| Data Leakage | Present | 0 | Train/test separation |

### Validation Frameworks

#### Great Expectations (10 Tests)

| Test | Purpose | Status |
|------|---------|--------|
| Database Schema | Validate SQLite structure | ✅ Pass |
| TF-IDF Matrix | No NaN/Inf, sparsity checks | ✅ Pass |
| Binary Labels | Values in {0,1} | ✅ Pass |
| Feature-Label Alignment | Row count consistency | ✅ Pass |
| Label Distribution | Min 5 occurrences per label | ✅ Pass |
| SMOTE Compatibility | Min 10 non-zero features | ✅ Pass |
| Multi-Output Format | >50% multi-label samples | ✅ Pass |
| Duplicate Detection | No duplicate features | ✅ Pass |
| Train-Test Separation | Zero intersection | ✅ Pass |
| Label Consistency | Same features → same labels | ✅ Pass |

#### Deepchecks (24 Checks)

- **Data Integrity Suite**: 92% score (12 checks)
- **Train-Test Validation Suite**: 100% score (12 checks)
- **Overall Status**: Production-ready (96% combined)

#### Behavioral Testing (36 Tests)

| Category | Tests | Description |
|----------|-------|-------------|
| Invariance | 9 | Typo, case, punctuation robustness |
| Directional | 10 | Keyword addition effects |
| Minimum Functionality | 17 | Basic skill predictions |

### Code Quality

- **Ruff Analysis**: 28 minor issues (100% fixable)
- **Standards**: PEP 8 compliant, Black compatible

Full details: [testing_and_validation.md](./testing_and_validation.md)

---

## Milestone 4: API Development

**Objective:** Implement production-ready REST API for skill prediction with MLflow integration.

### Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/predict` | Single issue prediction |
| `POST` | `/predict/batch` | Batch predictions (max 100) |
| `GET` | `/predictions/{run_id}` | Retrieve by MLflow Run ID |
| `GET` | `/predictions` | List recent predictions |
| `GET` | `/health` | Service health check |
| `GET` | `/metrics` | Prometheus metrics |

### Features

- **FastAPI Framework**: Async request handling, auto-generated OpenAPI docs
- **MLflow Integration**: All predictions logged with metadata
- **Pydantic Validation**: Request/response schema enforcement
- **Prometheus Metrics**: Request counters, latency histograms, gauges

### Documentation Access

- Swagger UI: `/docs`
- ReDoc: `/redoc`
- OpenAPI JSON: `/openapi.json`

---

## Milestone 5: Deployment & Containerization

**Objective:** Implement containerized deployment with CI/CD pipeline for production delivery.

### Docker Architecture

```
docker/docker-compose.yml
├── hopcroft-api (FastAPI Backend)
│   ├── Port: 8080
│   ├── Health Check: /health
│   └── Volumes: source code, logs
│
├── hopcroft-gui (Streamlit Frontend)
│   ├── Port: 8501
│   └── Depends on: hopcroft-api
│
└── hopcroft-net (Bridge Network)
```

### Hugging Face Spaces Deployment

| Component | Configuration |
|-----------|---------------|
| SDK | Docker |
| Port | 7860 |
| Startup Script | `docker/scripts/start_space.sh` |
| Secrets | `DAGSHUB_USERNAME`, `DAGSHUB_TOKEN` |

**Startup Flow:**
1. Configure DVC with secrets
2. Pull models from DagsHub
3. Start FastAPI (port 8000)
4. Start Streamlit (port 8501)
5. Start Nginx reverse proxy (port 7860)

### CI/CD Pipeline (GitHub Actions)

```yaml
Triggers: push/PR to main, feature/*
Jobs:
  1. unit-tests
     - Ruff linting
     - Pytest unit tests
     - HTML report generation
  
  2. build-image (requires unit-tests)
     - DVC model pull
     - Docker image build
```

---

## Milestone 6: Monitoring & Observability

**Objective:** Implement comprehensive monitoring infrastructure with drift detection.

### Prometheus Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `hopcroft_requests_total` | Counter | Total requests by method/endpoint |
| `hopcroft_request_duration_seconds` | Histogram | Request latency distribution |
| `hopcroft_in_progress_requests` | Gauge | Currently processing requests |
| `hopcroft_prediction_processing_seconds` | Summary | Model inference time |

### Grafana Dashboards

- **Request Rate**: Real-time requests per second
- **Request Latency (p50, p95)**: Response time percentiles
- **In-Progress Requests**: Currently processing requests
- **Error Rate (5xx)**: Failed request percentage
- **Model Prediction Time**: Inference latency
- **Requests by Endpoint**: Traffic distribution

### Data Drift Detection

| Component | Details |
|-----------|---------|
| Algorithm | Kolmogorov-Smirnov Two-Sample Test |
| Baseline | 1000 samples from training data |
| Threshold | p-value < 0.05 (Bonferroni corrected) |
| Metrics | `drift_detected`, `drift_p_value`, `drift_distance` |

### Alerting Rules

| Alert | Condition |
|-------|-----------|
| `ServiceDown` | Target unreachable for 5m |
| `HighErrorRate` | 5xx rate > 10% for 5m |
| `SlowRequests` | P95 latency > 2s |

### Load Testing (Locust)

| Task | Weight | Endpoint |
|------|--------|----------|
| Single Prediction | 60% | `POST /predict` |
| Batch Prediction | 20% | `POST /predict/batch` |
| Monitoring | 20% | `GET /health`, `/predictions` |

### HF Spaces Monitoring Access

Both Prometheus and Grafana are available on the production deployment:

| Service | URL |
|---------|-----|
| Prometheus | https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/ |
| Grafana | https://dacrow13-hopcroft-skill-classification.hf.space/grafana/ |

### Uptime Monitoring (Better Stack)

- External monitoring from multiple locations
- Email notifications on failures
- Tracked endpoints: `/health`, `/openapi.json`, `/docs`