Hopcroft-Skill-Classification

Sleeping

App Files Files Community

Hopcroft-Skill-Classification / README.md

DaCrow13

docs: add Hugging Face Spaces deployment instructions to README.

a894f75 about 1 month ago

preview code

raw

history blame contribute delete

21.6 kB

	---
	title: Hopcroft Skill Classification
	emoji: 🧠
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	api_docs_url: /docs
	---

	# Hopcroft_Skill-Classification-Tool-Competition

	The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.

	## Project Organization

	```
	├── LICENSE <- Open-source license if one is chosen
	├── Makefile <- Makefile with convenience commands like `make data` or `make train`
	├── README.md <- The top-level README for developers using this project.
	├── data
	│ ├── external <- Data from third party sources.
	│ ├── interim <- Intermediate data that has been transformed.
	│ ├── processed <- The final, canonical data sets for modeling.
	│ └── raw <- The original, immutable data dump.
	│
	├── docs <- A default mkdocs project; see www.mkdocs.org for details
	│
	├── models <- Trained and serialized models, model predictions, or model summaries
	│
	├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
	│ the creator's initials, and a short `-` delimited description, e.g.
	│ `1.0-jqp-initial-data-exploration`.
	│
	├── pyproject.toml <- Project configuration file with package metadata for
	│ hopcroft_skill_classification_tool_competition and configuration for tools like black
	│
	├── references <- Data dictionaries, manuals, and all other explanatory materials.
	│
	├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
	│ └── figures <- Generated graphics and figures to be used in reporting
	│
	├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
	│ generated with `pip freeze > requirements.txt`
	│
	├── setup.cfg <- Configuration file for flake8
	│
	└── hopcroft_skill_classification_tool_competition <- Source code for use in this project.
	│
	├── __init__.py <- Makes hopcroft_skill_classification_tool_competition a Python module
	│
	├── config.py <- Store useful variables and configuration
	│
	├── dataset.py <- Scripts to download or generate data
	│
	├── features.py <- Code to create features for modeling
	│
	├── modeling
	│ ├── __init__.py
	│ ├── predict.py <- Code to run model inference with trained models
	│ └── train.py <- Code to train models
	│
	└── plots.py <- Code to create visualizations
	```

	--------

	## Setup

	### MLflow Credentials Configuration

	Set up DagsHub credentials for MLflow tracking.

	Get your token: [DagsHub](https://dagshub.com) → Profile → Settings → Tokens

	#### Option 1: Using `.env` file (Recommended for local development)

	```bash
	# Copy the template
	cp .env.example .env

	# Edit .env with your credentials
	```

	Your `.env` file should contain:
	```
	MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
	MLFLOW_TRACKING_USERNAME=your_username
	MLFLOW_TRACKING_PASSWORD=your_token
	```

	> [!NOTE]
	> The `.env` file is git-ignored for security. Never commit credentials to version control.

	#### Option 2: Using Docker Compose

	When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`.

	```bash
	# Start the service (credentials loaded from .env)
	docker compose up --build
	```

	--------

	## CI Configuration

	[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)

	This project uses automatically triggered GitHub Actions triggers for Continuous Integration.

	### Secrets

	To enable DVC model pulling, configure these Repository Secrets:

	- `DAGSHUB_USERNAME`: DagsHub username.
	- `DAGSHUB_TOKEN`: DagsHub access token.

	--------

	## Milestone Summary

	### Milestone 1
	We compiled the ML Canvas and defined:
	- Problem: multi-label classification of skills for PR/issues.
	- Stakeholders and business/research goals.
	- Data sources (SkillScope DB) and constraints (no external classifiers).
	- Success metrics (micro-F1, imbalance handling, experiment tracking).
	- Risks (label imbalance, text noise, multi-label complexity) and mitigations.

	### Milestone 2
	We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:

	1. Data Management
	- DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.

	2. Data Ingestion & EDA
	- `dataset.py` to download/extract SkillScope from Hugging Face (zip → SQLite) with cleanup.
	- Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution).

	3. Feature Engineering
	- `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`).

	4. Central Config
	- `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.

	5. Modeling & Experiments
	- Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
	- GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).

	6. Imbalance Handling
	- Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline.

	7. Tracking & Reproducibility
	- Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).

	8. Tooling
	- Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`).

	### Milestone 3 (QA)
	We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:

	1. Data Cleaning Pipeline
	- `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
	- Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.

	2. Great Expectations Validation (10 tests)
	- Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
	- Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
	- All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`.

	3. Deepchecks Validation (24 checks across 2 suites)
	- Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
	- Train-Test Validation Suite (100% score): zero data leakage, proper train/test split, feature/label drift analysis.
	- Cleaned data achieved production-ready status (96% overall score).

	4. Behavioral Testing (36 tests)
	- Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
	- Directional tests (10): keyword addition effects, technical detail impact on predictions.
	- Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
	- All tests passed; comprehensive report in `reports/behavioral/`.

	5. Code Quality Analysis
	- Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
	- PEP 8 compliant, Black compatible (line length 88).

	6. Documentation
	- Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results.
	- Behavioral testing README with test categories, usage examples, and extension guide.

	7. Tooling
	- Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`.
	- Automated test execution and report generation.

	### Milestone 4 (API)
	We implemented a production-ready FastAPI service for skill prediction with MLflow integration:

	#### Features
	- REST API Endpoints:
	- `POST /predict` - Predict skills for a GitHub issue (logs to MLflow)
	- `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID
	- `GET /predictions` - List recent predictions with pagination
	- `GET /health` - Health check endpoint
	- Model Management: Loads trained Random Forest + TF-IDF vectorizer from `models/`
	- MLflow Tracking: All predictions logged with metadata, probabilities, and timestamps
	- Input Validation: Pydantic models for request/response validation
	- Interactive Docs: Auto-generated Swagger UI and ReDoc

	#### API Usage

	1. Start the API Server
	```bash
	# Development mode (auto-reload)
	make api-dev

	# Production mode
	make api-run
	```
	Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000)

	2. Test Endpoints

	Option A: Swagger UI (Recommended)
	- Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
	- Interactive interface to test all endpoints
	- View request/response schemas

	Option B: Make Commands
	```bash
	# Test all endpoints
	make test-api-all

	# Individual endpoints
	make test-api-health # Health check
	make test-api-predict # Single prediction
	make test-api-list # List predictions
	```

	#### Prerequisites
	- Trained model: `models/random_forest_tfidf_gridsearch.pkl`
	- TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation)
	- Label names: `models/label_names.pkl` (auto-saved during feature creation)

	#### MLflow Integration
	- All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
	- Experiment: `skill_prediction_api`
	- Tracked: input text, predictions, probabilities, metadata

	#### Docker
	Build and run the API in a container:
	```bash
	docker build -t hopcroft-api .
	docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
	```

	Endpoints:
	- Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs)
	- Health check: [http://localhost:8080/health](http://localhost:8080/health)

	---

	## Docker Compose Usage

	Docker Compose orchestrates both the API backend and Streamlit GUI services with proper networking and configuration.

	### Prerequisites

	1. Create your environment file:
	```bash
	cp .env.example .env
	```

	2. Edit `.env` with your actual credentials:
	```
	MLFLOW_TRACKING_USERNAME=your_dagshub_username
	MLFLOW_TRACKING_PASSWORD=your_dagshub_token
	```

	Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens)

	### Quick Start

	#### 1. Build and Start All Services
	Build both images and start the containers:
	```bash
	docker-compose up -d --build
	```

	\| Flag \| Description \|
	\|------\|-------------\|
	\| `-d` \| Run in detached mode (background) \|
	\| `--build` \| Rebuild images before starting (use when code/Dockerfile changes) \|

	Available Services:
	- API (FastAPI): [http://localhost:8080/docs](http://localhost:8080/docs)
	- GUI (Streamlit): [http://localhost:8501](http://localhost:8501)
	- Health Check: [http://localhost:8080/health](http://localhost:8080/health)

	#### 2. Stop All Services
	Stop and remove containers and networks:
	```bash
	docker-compose down
	```

	\| Flag \| Description \|
	\|------\|-------------\|
	\| `-v` \| Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` \|
	\| `--rmi all` \| Also remove images: `docker-compose down --rmi all` \|

	#### 3. Restart Services
	After updating `.env` or configuration files:
	```bash
	docker-compose restart
	```

	Or for a full restart with environment reload:
	```bash
	docker-compose down
	docker-compose up -d
	```

	#### 4. Check Status
	View the status of all running services:
	```bash
	docker-compose ps
	```

	Or use Docker commands:
	```bash
	docker ps
	```

	#### 5. View Logs
	Tail logs from both services in real-time:
	```bash
	docker-compose logs -f
	```

	View logs from a specific service:
	```bash
	docker-compose logs -f hopcroft-api
	docker-compose logs -f hopcroft-gui
	```

	\| Flag \| Description \|
	\|------\|-------------\|
	\| `-f` \| Follow log output (stream new logs) \|
	\| `--tail 100` \| Show only last 100 lines: `docker-compose logs --tail 100` \|

	#### 6. Execute Commands in Container
	Open an interactive shell inside a running container:
	```bash
	docker-compose exec hopcroft-api /bin/bash
	docker-compose exec hopcroft-gui /bin/bash
	```

	Examples of useful commands inside the API container:
	```bash
	# Check installed packages
	pip list

	# Run Python interactively
	python

	# Check model file exists
	ls -la /app/models/

	# Verify environment variables
	printenv \| grep MLFLOW
	```
	```

	### Architecture Overview

	Docker Compose orchestrates two services:

	```
	docker-compose.yml
	├── hopcroft-api (FastAPI Backend)
	│ ├── Build: ./Dockerfile
	│ ├── Port: 8080:8080
	│ ├── Network: hopcroft-net
	│ ├── Environment: .env (MLflow credentials)
	│ ├── Volumes:
	│ │ ├── ./hopcroft_skill_classification_tool_competition (hot reload)
	│ │ └── hopcroft-logs:/app/logs (persistent logs)
	│ └── Health Check: /health endpoint
	│
	├── hopcroft-gui (Streamlit Frontend)
	│ ├── Build: ./Dockerfile.streamlit
	│ ├── Port: 8501:8501
	│ ├── Network: hopcroft-net
	│ ├── Environment: API_BASE_URL=http://hopcroft-api:8080
	│ ├── Volumes:
	│ │ └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload)
	│ └── Depends on: hopcroft-api (waits for health check)
	│
	└── hopcroft-net (bridge network)
	```

	External Access:
	- API: http://localhost:8080
	- GUI: http://localhost:8501

	Internal Communication:
	- GUI → API: http://hopcroft-api:8080 (via Docker network)

	### Services Description

	hopcroft-api (FastAPI Backend)
	- Purpose: FastAPI backend serving the ML model for skill classification
	- Image: Built from `Dockerfile`
	- Port: 8080 (maps to host 8080)
	- Features:
	- Random Forest model with embedding features
	- MLflow experiment tracking
	- Auto-reload in development mode
	- Health check endpoint

	hopcroft-gui (Streamlit Frontend)
	- Purpose: Streamlit web interface for interactive predictions
	- Image: Built from `Dockerfile.streamlit`
	- Port: 8501 (maps to host 8501)
	- Features:
	- User-friendly interface for skill prediction
	- Real-time communication with API
	- Automatic reconnection on API restart
	- Depends on API health before starting

	### Development vs Production

	Development (default):
	- Auto-reload enabled (`--reload`)
	- Source code mounted with bind mounts
	- Custom command with hot reload
	- GUI → API via Docker network

	Production:
	- Auto-reload disabled
	- Use built image only
	- Use Dockerfile's CMD
	- GUI → API via Docker network

	For production deployment, modify `docker-compose.yml` to remove bind mounts and disable reload.

	### Troubleshooting

	#### Issue: GUI shows "API is not available"
	Solution:
	1. Wait 30-60 seconds for API to fully initialize and become healthy
	2. Refresh the GUI page (F5)
	3. Check API health: `curl http://localhost:8080/health`
	4. Check logs: `docker-compose logs hopcroft-api`

	#### Issue: "500 Internal Server Error" on predictions
	Solution:
	1. Verify MLflow credentials in `.env` are correct
	2. Restart services: `docker-compose down && docker-compose up -d`
	3. Check environment variables: `docker exec hopcroft-api printenv \| grep MLFLOW`

	#### Issue: Changes to code not reflected
	Solution:
	- For Python code changes: Auto-reload is enabled, wait a few seconds
	- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
	- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`

	#### Issue: Port already in use
	Solution:
	```bash
	# Check what's using the port
	netstat -ano \| findstr :8080
	netstat -ano \| findstr :8501

	# Stop existing containers
	docker-compose down

	# Or change ports in docker-compose.yml
	```


	--------

	## Hugging Face Spaces Deployment

	This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.

	### 1. Setup Space
	1. Create a new Space on Hugging Face.
	2. Select Docker as the SDK.
	3. Choose the Blank template or upload your code.

	### 2. Configure Secrets
	To enable the application to pull models from DagsHub via DVC, you must configure the following Variables and Secrets in your Space settings:

	\| Name \| Type \| Description \|
	\|------\|------\|-------------\|
	\| `DAGSHUB_USERNAME` \| Secret \| Your DagsHub username. \|
	\| `DAGSHUB_TOKEN` \| Secret \| Your DagsHub access token (Settings -> Tokens). \|

	> [!IMPORTANT]
	> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.

	### 3. Automated Startup
	The deployment follows this automated flow:
	1. Dockerfile: Builds the environment, installs dependencies, and sets up Nginx.
	2. scripts/start_space.sh:
	- Configures DVC with your secrets.
	- Pulls models from the DagsHub remote.
	- Starts the FastAPI backend (port 8000).
	- Starts the Streamlit frontend (port 8501).
	- Starts Nginx (port 7860) as a reverse proxy to route traffic.

	### 4. Direct Access
	Once deployed, your Space will be available at:
	`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`

	The API documentation will be accessible at:
	`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`

	--------

	## Demo UI (Streamlit)

	The Streamlit GUI provides an interactive web interface for the skill classification API.

	### Features
	- Real-time skill prediction from GitHub issue text
	- Top-5 predicted skills with confidence scores
	- Full predictions table with all skills
	- API connection status indicator
	- Responsive design

	### Usage
	1. Ensure both services are running: `docker-compose up -d`
	2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501)
	3. Enter a GitHub issue description in the text area
	4. Click "Predict Skills" to get predictions
	5. View results in the predictions table

	### Architecture
	- Frontend: Streamlit (Python web framework)
	- Communication: HTTP requests to FastAPI backend via Docker network
	- Independence: GUI and API run in separate containers
	- Auto-reload: GUI code changes are reflected immediately (bind mount)
	> Both must run simultaneously in different terminals/containers.

	### Quick Start

	1. Start the FastAPI backend:
	```bash
	fastapi dev hopcroft_skill_classification_tool_competition/main.py
	```

	2. In a new terminal, start Streamlit:
	```bash
	streamlit run streamlit_app.py
	```

	3. Open your browser:
	- Streamlit UI: http://localhost:8501
	- FastAPI Docs: http://localhost:8000/docs

	### Features

	- Interactive web interface for skill prediction
	- Real-time predictions with confidence scores
	- Adjustable confidence threshold
	- Multiple input modes (quick/detailed/examples)
	- Visual result display
	- API health monitoring

	### Demo Walkthrough

	#### Main Dashboard

	![gui_main_dashboard](docs/img/gui_main_dashboard.png)

	The main interface provides:
	- Sidebar: API health status, confidence threshold slider, model info
	- Three input modes: Quick Input, Detailed Input, Examples
	#### Quick Input Mode

	![gui_quick_input](docs/img/gui_quick_input.png)
	Simply paste your GitHub issue text and click "Predict Skills"!

	#### Prediction Results
	![gui_detailed](docs/img/gui_detailed.png)
	View:
	- Top predictions with confidence scores
	- Full predictions table with filtering
	- Processing metrics (time, model version)
	- Raw JSON response (expandable)

	#### Detailed Input Mode

	![gui_detailed_input](docs/img/gui_detailed_input.png)
	Add optional metadata:
	- Repository name
	- PR number
	- Detailed description

	#### Example Gallery
	![gui_ex](docs/img/gui_ex.png)

	Test with pre-loaded examples:
	- Authentication bugs
	- ML features
	- Database issues
	- UI enhancements


	### Usage

	1. Enter GitHub issue/PR text in the input area
	2. (Optional) Add description, repo name, PR number
	3. Click "Predict Skills"
	4. View results with confidence scores
	5. Adjust threshold slider to filter predictions