Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / docs /milestone_summaries.md

maurocarlu

nginx endpoints addition - grafana documentation update

70cbf15 22 days ago

preview code

raw

history blame contribute delete

9.15 kB

	# Milestone Summaries

	This document provides a comprehensive overview of all six project milestones, documenting the evolution of the Hopcroft Skill Classification system from requirements engineering through production monitoring.

	---

	## Milestone 1: Requirements Engineering

	Objective: Define the problem space, stakeholders, and success criteria using the Machine Learning Canvas framework.

	### Key Deliverables

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| Prediction Task \| Multi-label classification of 217 technical skills from GitHub issue/PR text \|
	\| Stakeholders \| Project managers, team leads, developers \|
	\| Data Source \| SkillScope DB with 7,245 merged PRs from 11 Java repositories \|
	\| Success Metrics \| Micro-F1 score improvement over baseline, precision/recall balance \|

	### ML Canvas Framework

	The complete ML Canvas is documented in [ML Canvas.md](./ML%20Canvas.md), covering:

	- Value Proposition: Automated task assignment optimization
	- Decisions: Resource allocation for issue resolution
	- Data Collection: Automated labeling via API call detection
	- Impact Simulation: Outperform SkillScope RF + TF-IDF baseline
	- Monitoring: Continuous evaluation with drift detection

	### Identified Risks & Mitigations

	\| Risk \| Mitigation Strategy \|
	\|------\|---------------------\|
	\| Label imbalance (217 classes) \| SMOTE, MLSMOTE, ADASYN oversampling \|
	\| Text noise (URLs, HTML, code) \| Custom preprocessing pipeline \|
	\| Multi-label complexity \| MultiOutputClassifier with stratified splits \|

	---

	## Milestone 2: Data Management & Experiment Tracking

	Objective: Establish end-to-end infrastructure for reproducible ML experiments.

	### Data Pipeline

	```
	data/raw/ → dataset.py → data/processed/
	(SkillScope SQLite) (HuggingFace) (Clean CSV)
	↓
	features.py
	↓
	data/processed/
	(TF-IDF/Embeddings)
	```

	### Key Components

	1. Data Management
	- DVC setup with DagsHub remote storage
	- Git-ignored data and model directories
	- Version-controlled `.dvc` files for reproducibility

	2. Data Ingestion
	- `dataset.py`: Downloads SkillScope from Hugging Face
	- Extracts SQLite database with cleanup

	3. Feature Engineering
	- `features.py`: Text cleaning pipeline
	- URL/HTML/Markdown removal
	- Normalization and Porter stemming
	- TF-IDF vectorization (uni+bi-grams)
	- Sentence embedding generation

	4. Configuration
	- `config.py`: Centralized paths, hyperparameters, MLflow URI

	5. Experiment Tracking
	- MLflow with DagsHub remote
	- Logged metrics: precision, recall, F1-score
	- Artifact storage: models, vectorizers, scalers

	### Training Actions

	\| Action \| Description \|
	\|--------\|-------------\|
	\| `baseline` \| Random Forest with TF-IDF \|
	\| `mlsmote` \| Multi-label SMOTE oversampling \|
	\| `ros` \| Random Oversampling \|
	\| `adasyn-pca` \| ADASYN + PCA dimensionality reduction \|
	\| `lightgbm` \| LightGBM classifier \|

	---

	## Milestone 3: Quality Assurance

	Objective: Implement comprehensive testing and validation framework for data quality and model robustness.

	### Data Cleaning Pipeline

	\| Metric \| Before \| After \| Resolution \|
	\|--------\|--------\|-------\|------------\|
	\| Total Samples \| 7,154 \| 6,673 \| -481 duplicates \|
	\| Duplicates \| 481 \| 0 \| Exact match removal \|
	\| Label Conflicts \| 640 \| 0 \| Majority voting \|
	\| Data Leakage \| Present \| 0 \| Train/test separation \|

	### Validation Frameworks

	#### Great Expectations (10 Tests)

	\| Test \| Purpose \| Status \|
	\|------\|---------\|--------\|
	\| Database Schema \| Validate SQLite structure \| ✅ Pass \|
	\| TF-IDF Matrix \| No NaN/Inf, sparsity checks \| ✅ Pass \|
	\| Binary Labels \| Values in {0,1} \| ✅ Pass \|
	\| Feature-Label Alignment \| Row count consistency \| ✅ Pass \|
	\| Label Distribution \| Min 5 occurrences per label \| ✅ Pass \|
	\| SMOTE Compatibility \| Min 10 non-zero features \| ✅ Pass \|
	\| Multi-Output Format \| >50% multi-label samples \| ✅ Pass \|
	\| Duplicate Detection \| No duplicate features \| ✅ Pass \|
	\| Train-Test Separation \| Zero intersection \| ✅ Pass \|
	\| Label Consistency \| Same features → same labels \| ✅ Pass \|

	#### Deepchecks (24 Checks)

	- Data Integrity Suite: 92% score (12 checks)
	- Train-Test Validation Suite: 100% score (12 checks)
	- Overall Status: Production-ready (96% combined)

	#### Behavioral Testing (36 Tests)

	\| Category \| Tests \| Description \|
	\|----------\|-------\|-------------\|
	\| Invariance \| 9 \| Typo, case, punctuation robustness \|
	\| Directional \| 10 \| Keyword addition effects \|
	\| Minimum Functionality \| 17 \| Basic skill predictions \|

	### Code Quality

	- Ruff Analysis: 28 minor issues (100% fixable)
	- Standards: PEP 8 compliant, Black compatible

	Full details: [testing_and_validation.md](./testing_and_validation.md)

	---

	## Milestone 4: API Development

	Objective: Implement production-ready REST API for skill prediction with MLflow integration.

	### Endpoints

	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| `POST` \| `/predict` \| Single issue prediction \|
	\| `POST` \| `/predict/batch` \| Batch predictions (max 100) \|
	\| `GET` \| `/predictions/{run_id}` \| Retrieve by MLflow Run ID \|
	\| `GET` \| `/predictions` \| List recent predictions \|
	\| `GET` \| `/health` \| Service health check \|
	\| `GET` \| `/metrics` \| Prometheus metrics \|

	### Features

	- FastAPI Framework: Async request handling, auto-generated OpenAPI docs
	- MLflow Integration: All predictions logged with metadata
	- Pydantic Validation: Request/response schema enforcement
	- Prometheus Metrics: Request counters, latency histograms, gauges

	### Documentation Access

	- Swagger UI: `/docs`
	- ReDoc: `/redoc`
	- OpenAPI JSON: `/openapi.json`

	---

	## Milestone 5: Deployment & Containerization

	Objective: Implement containerized deployment with CI/CD pipeline for production delivery.

	### Docker Architecture

	```
	docker/docker-compose.yml
	├── hopcroft-api (FastAPI Backend)
	│ ├── Port: 8080
	│ ├── Health Check: /health
	│ └── Volumes: source code, logs
	│
	├── hopcroft-gui (Streamlit Frontend)
	│ ├── Port: 8501
	│ └── Depends on: hopcroft-api
	│
	└── hopcroft-net (Bridge Network)
	```

	### Hugging Face Spaces Deployment

	\| Component \| Configuration \|
	\|-----------\|---------------\|
	\| SDK \| Docker \|
	\| Port \| 7860 \|
	\| Startup Script \| `docker/scripts/start_space.sh` \|
	\| Secrets \| `DAGSHUB_USERNAME`, `DAGSHUB_TOKEN` \|

	Startup Flow:
	1. Configure DVC with secrets
	2. Pull models from DagsHub
	3. Start FastAPI (port 8000)
	4. Start Streamlit (port 8501)
	5. Start Nginx reverse proxy (port 7860)

	### CI/CD Pipeline (GitHub Actions)

	```yaml
	Triggers: push/PR to main, feature/*
	Jobs:
	1. unit-tests
	- Ruff linting
	- Pytest unit tests
	- HTML report generation

	2. build-image (requires unit-tests)
	- DVC model pull
	- Docker image build
	```

	---

	## Milestone 6: Monitoring & Observability

	Objective: Implement comprehensive monitoring infrastructure with drift detection.

	### Prometheus Metrics

	\| Metric \| Type \| Description \|
	\|--------\|------\|-------------\|
	\| `hopcroft_requests_total` \| Counter \| Total requests by method/endpoint \|
	\| `hopcroft_request_duration_seconds` \| Histogram \| Request latency distribution \|
	\| `hopcroft_in_progress_requests` \| Gauge \| Currently processing requests \|
	\| `hopcroft_prediction_processing_seconds` \| Summary \| Model inference time \|

	### Grafana Dashboards

	- Request Rate: Real-time requests per second
	- Request Latency (p50, p95): Response time percentiles
	- In-Progress Requests: Currently processing requests
	- Error Rate (5xx): Failed request percentage
	- Model Prediction Time: Inference latency
	- Requests by Endpoint: Traffic distribution

	### Data Drift Detection

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Algorithm \| Kolmogorov-Smirnov Two-Sample Test \|
	\| Baseline \| 1000 samples from training data \|
	\| Threshold \| p-value < 0.05 (Bonferroni corrected) \|
	\| Metrics \| `drift_detected`, `drift_p_value`, `drift_distance` \|

	### Alerting Rules

	\| Alert \| Condition \|
	\|-------\|-----------\|
	\| `ServiceDown` \| Target unreachable for 5m \|
	\| `HighErrorRate` \| 5xx rate > 10% for 5m \|
	\| `SlowRequests` \| P95 latency > 2s \|

	### Load Testing (Locust)

	\| Task \| Weight \| Endpoint \|
	\|------\|--------\|----------\|
	\| Single Prediction \| 60% \| `POST /predict` \|
	\| Batch Prediction \| 20% \| `POST /predict/batch` \|
	\| Monitoring \| 20% \| `GET /health`, `/predictions` \|

	### HF Spaces Monitoring Access

	Both Prometheus and Grafana are available on the production deployment:

	\| Service \| URL \|
	\|---------\|-----\|
	\| Prometheus \| https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/ \|
	\| Grafana \| https://dacrow13-hopcroft-skill-classification.hf.space/grafana/ \|

	### Uptime Monitoring (Better Stack)

	- External monitoring from multiple locations
	- Email notifications on failures
	- Tracked endpoints: `/health`, `/openapi.json`, `/docs`