DaCrow13
docs: add Hugging Face Spaces deployment instructions to README.
a894f75
---
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs
---
# Hopcroft_Skill-Classification-Tool-Competition
The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.
## Project Organization
```
β”œβ”€β”€ LICENSE <- Open-source license if one is chosen
β”œβ”€β”€ Makefile <- Makefile with convenience commands like `make data` or `make train`
β”œβ”€β”€ README.md <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚ β”œβ”€β”€ external <- Data from third party sources.
β”‚ β”œβ”€β”€ interim <- Intermediate data that has been transformed.
β”‚ β”œβ”€β”€ processed <- The final, canonical data sets for modeling.
β”‚ └── raw <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs <- A default mkdocs project; see www.mkdocs.org for details
β”‚
β”œβ”€β”€ models <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚ the creator's initials, and a short `-` delimited description, e.g.
β”‚ `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ pyproject.toml <- Project configuration file with package metadata for
β”‚ hopcroft_skill_classification_tool_competition and configuration for tools like black
β”‚
β”œβ”€β”€ references <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚ └── figures <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β”‚ generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.cfg <- Configuration file for flake8
β”‚
└── hopcroft_skill_classification_tool_competition <- Source code for use in this project.
β”‚
β”œβ”€β”€ __init__.py <- Makes hopcroft_skill_classification_tool_competition a Python module
β”‚
β”œβ”€β”€ config.py <- Store useful variables and configuration
β”‚
β”œβ”€β”€ dataset.py <- Scripts to download or generate data
β”‚
β”œβ”€β”€ features.py <- Code to create features for modeling
β”‚
β”œβ”€β”€ modeling
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ predict.py <- Code to run model inference with trained models
β”‚ └── train.py <- Code to train models
β”‚
└── plots.py <- Code to create visualizations
```
--------
## Setup
### MLflow Credentials Configuration
Set up DagsHub credentials for MLflow tracking.
**Get your token:** [DagsHub](https://dagshub.com) β†’ Profile β†’ Settings β†’ Tokens
#### Option 1: Using `.env` file (Recommended for local development)
```bash
# Copy the template
cp .env.example .env
# Edit .env with your credentials
```
Your `.env` file should contain:
```
MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_token
```
> [!NOTE]
> The `.env` file is git-ignored for security. Never commit credentials to version control.
#### Option 2: Using Docker Compose
When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`.
```bash
# Start the service (credentials loaded from .env)
docker compose up --build
```
--------
## CI Configuration
[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)
This project uses automatically triggered GitHub Actions triggers for Continuous Integration.
### Secrets
To enable DVC model pulling, configure these Repository Secrets:
- `DAGSHUB_USERNAME`: DagsHub username.
- `DAGSHUB_TOKEN`: DagsHub access token.
--------
## Milestone Summary
### Milestone 1
We compiled the ML Canvas and defined:
- Problem: multi-label classification of skills for PR/issues.
- Stakeholders and business/research goals.
- Data sources (SkillScope DB) and constraints (no external classifiers).
- Success metrics (micro-F1, imbalance handling, experiment tracking).
- Risks (label imbalance, text noise, multi-label complexity) and mitigations.
### Milestone 2
We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:
1. Data Management
- DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.
2. Data Ingestion & EDA
- `dataset.py` to download/extract SkillScope from Hugging Face (zip β†’ SQLite) with cleanup.
- Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution).
3. Feature Engineering
- `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`).
4. Central Config
- `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.
5. Modeling & Experiments
- Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
- GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).
6. Imbalance Handling
- Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline.
7. Tracking & Reproducibility
- Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).
8. Tooling
- Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`).
### Milestone 3 (QA)
We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:
1. **Data Cleaning Pipeline**
- `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
- Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.
2. **Great Expectations Validation** (10 tests)
- Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
- Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
- All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`.
3. **Deepchecks Validation** (24 checks across 2 suites)
- Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
- Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis.
- Cleaned data achieved production-ready status (96% overall score).
4. **Behavioral Testing** (36 tests)
- Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
- Directional tests (10): keyword addition effects, technical detail impact on predictions.
- Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
- All tests passed; comprehensive report in `reports/behavioral/`.
5. **Code Quality Analysis**
- Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
- PEP 8 compliant, Black compatible (line length 88).
6. **Documentation**
- Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results.
- Behavioral testing README with test categories, usage examples, and extension guide.
7. **Tooling**
- Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`.
- Automated test execution and report generation.
### Milestone 4 (API)
We implemented a production-ready FastAPI service for skill prediction with MLflow integration:
#### Features
- **REST API Endpoints**:
- `POST /predict` - Predict skills for a GitHub issue (logs to MLflow)
- `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID
- `GET /predictions` - List recent predictions with pagination
- `GET /health` - Health check endpoint
- **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/`
- **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps
- **Input Validation**: Pydantic models for request/response validation
- **Interactive Docs**: Auto-generated Swagger UI and ReDoc
#### API Usage
**1. Start the API Server**
```bash
# Development mode (auto-reload)
make api-dev
# Production mode
make api-run
```
Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000)
**2. Test Endpoints**
**Option A: Swagger UI (Recommended)**
- Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
- Interactive interface to test all endpoints
- View request/response schemas
**Option B: Make Commands**
```bash
# Test all endpoints
make test-api-all
# Individual endpoints
make test-api-health # Health check
make test-api-predict # Single prediction
make test-api-list # List predictions
```
#### Prerequisites
- Trained model: `models/random_forest_tfidf_gridsearch.pkl`
- TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation)
- Label names: `models/label_names.pkl` (auto-saved during feature creation)
#### MLflow Integration
- All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
- Experiment: `skill_prediction_api`
- Tracked: input text, predictions, probabilities, metadata
#### Docker
Build and run the API in a container:
```bash
docker build -t hopcroft-api .
docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
```
Endpoints:
- Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs)
- Health check: [http://localhost:8080/health](http://localhost:8080/health)
---
## Docker Compose Usage
Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration.
### Prerequisites
1. **Create your environment file:**
```bash
cp .env.example .env
```
2. **Edit `.env`** with your actual credentials:
```
MLFLOW_TRACKING_USERNAME=your_dagshub_username
MLFLOW_TRACKING_PASSWORD=your_dagshub_token
```
Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens)
### Quick Start
#### 1. Build and Start All Services
Build both images and start the containers:
```bash
docker-compose up -d --build
```
| Flag | Description |
|------|-------------|
| `-d` | Run in detached mode (background) |
| `--build` | Rebuild images before starting (use when code/Dockerfile changes) |
**Available Services:**
- **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs)
- **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501)
- **Health Check:** [http://localhost:8080/health](http://localhost:8080/health)
#### 2. Stop All Services
Stop and remove containers and networks:
```bash
docker-compose down
```
| Flag | Description |
|------|-------------|
| `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` |
| `--rmi all` | Also remove images: `docker-compose down --rmi all` |
#### 3. Restart Services
After updating `.env` or configuration files:
```bash
docker-compose restart
```
Or for a full restart with environment reload:
```bash
docker-compose down
docker-compose up -d
```
#### 4. Check Status
View the status of all running services:
```bash
docker-compose ps
```
Or use Docker commands:
```bash
docker ps
```
#### 5. View Logs
Tail logs from both services in real-time:
```bash
docker-compose logs -f
```
View logs from a specific service:
```bash
docker-compose logs -f hopcroft-api
docker-compose logs -f hopcroft-gui
```
| Flag | Description |
|------|-------------|
| `-f` | Follow log output (stream new logs) |
| `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` |
#### 6. Execute Commands in Container
Open an interactive shell inside a running container:
```bash
docker-compose exec hopcroft-api /bin/bash
docker-compose exec hopcroft-gui /bin/bash
```
Examples of useful commands inside the API container:
```bash
# Check installed packages
pip list
# Run Python interactively
python
# Check model file exists
ls -la /app/models/
# Verify environment variables
printenv | grep MLFLOW
```
```
### Architecture Overview
**Docker Compose orchestrates two services:**
```
docker-compose.yml
β”œβ”€β”€ hopcroft-api (FastAPI Backend)
β”‚ β”œβ”€β”€ Build: ./Dockerfile
β”‚ β”œβ”€β”€ Port: 8080:8080
β”‚ β”œβ”€β”€ Network: hopcroft-net
β”‚ β”œβ”€β”€ Environment: .env (MLflow credentials)
β”‚ β”œβ”€β”€ Volumes:
β”‚ β”‚ β”œβ”€β”€ ./hopcroft_skill_classification_tool_competition (hot reload)
β”‚ β”‚ └── hopcroft-logs:/app/logs (persistent logs)
β”‚ └── Health Check: /health endpoint
β”‚
β”œβ”€β”€ hopcroft-gui (Streamlit Frontend)
β”‚ β”œβ”€β”€ Build: ./Dockerfile.streamlit
β”‚ β”œβ”€β”€ Port: 8501:8501
β”‚ β”œβ”€β”€ Network: hopcroft-net
β”‚ β”œβ”€β”€ Environment: API_BASE_URL=http://hopcroft-api:8080
β”‚ β”œβ”€β”€ Volumes:
β”‚ β”‚ └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload)
β”‚ └── Depends on: hopcroft-api (waits for health check)
β”‚
└── hopcroft-net (bridge network)
```
**External Access:**
- API: http://localhost:8080
- GUI: http://localhost:8501
**Internal Communication:**
- GUI β†’ API: http://hopcroft-api:8080 (via Docker network)
### Services Description
**hopcroft-api (FastAPI Backend)**
- Purpose: FastAPI backend serving the ML model for skill classification
- Image: Built from `Dockerfile`
- Port: 8080 (maps to host 8080)
- Features:
- Random Forest model with embedding features
- MLflow experiment tracking
- Auto-reload in development mode
- Health check endpoint
**hopcroft-gui (Streamlit Frontend)**
- Purpose: Streamlit web interface for interactive predictions
- Image: Built from `Dockerfile.streamlit`
- Port: 8501 (maps to host 8501)
- Features:
- User-friendly interface for skill prediction
- Real-time communication with API
- Automatic reconnection on API restart
- Depends on API health before starting
### Development vs Production
**Development (default):**
- Auto-reload enabled (`--reload`)
- Source code mounted with bind mounts
- Custom command with hot reload
- GUI β†’ API via Docker network
**Production:**
- Auto-reload disabled
- Use built image only
- Use Dockerfile's CMD
- GUI β†’ API via Docker network
For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.
### Troubleshooting
#### Issue: GUI shows "API is not available"
**Solution:**
1. Wait 30-60 seconds for API to fully initialize and become healthy
2. Refresh the GUI page (F5)
3. Check API health: `curl http://localhost:8080/health`
4. Check logs: `docker-compose logs hopcroft-api`
#### Issue: "500 Internal Server Error" on predictions
**Solution:**
1. Verify MLflow credentials in `.env` are correct
2. Restart services: `docker-compose down && docker-compose up -d`
3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`
#### Issue: Changes to code not reflected
**Solution:**
- For Python code changes: Auto-reload is enabled, wait a few seconds
- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`
#### Issue: Port already in use
**Solution:**
```bash
# Check what's using the port
netstat -ano | findstr :8080
netstat -ano | findstr :8501
# Stop existing containers
docker-compose down
# Or change ports in docker-compose.yml
```
--------
## Hugging Face Spaces Deployment
This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.
### 1. Setup Space
1. Create a new Space on Hugging Face.
2. Select **Docker** as the SDK.
3. Choose the **Blank** template or upload your code.
### 2. Configure Secrets
To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:
| Name | Type | Description |
|------|------|-------------|
| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |
> [!IMPORTANT]
> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.
### 3. Automated Startup
The deployment follows this automated flow:
1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
2. **scripts/start_space.sh**:
- Configures DVC with your secrets.
- Pulls models from the DagsHub remote.
- Starts the **FastAPI** backend (port 8000).
- Starts the **Streamlit** frontend (port 8501).
- Starts **Nginx** (port 7860) as a reverse proxy to route traffic.
### 4. Direct Access
Once deployed, your Space will be available at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`
The API documentation will be accessible at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`
--------
## Demo UI (Streamlit)
The Streamlit GUI provides an interactive web interface for the skill classification API.
### Features
- Real-time skill prediction from GitHub issue text
- Top-5 predicted skills with confidence scores
- Full predictions table with all skills
- API connection status indicator
- Responsive design
### Usage
1. Ensure both services are running: `docker-compose up -d`
2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501)
3. Enter a GitHub issue description in the text area
4. Click "Predict Skills" to get predictions
5. View results in the predictions table
### Architecture
- **Frontend**: Streamlit (Python web framework)
- **Communication**: HTTP requests to FastAPI backend via Docker network
- **Independence**: GUI and API run in separate containers
- **Auto-reload**: GUI code changes are reflected immediately (bind mount)
> Both must run **simultaneously** in different terminals/containers.
### Quick Start
1. **Start the FastAPI backend:**
```bash
fastapi dev hopcroft_skill_classification_tool_competition/main.py
```
2. **In a new terminal, start Streamlit:**
```bash
streamlit run streamlit_app.py
```
3. **Open your browser:**
- Streamlit UI: http://localhost:8501
- FastAPI Docs: http://localhost:8000/docs
### Features
- Interactive web interface for skill prediction
- Real-time predictions with confidence scores
- Adjustable confidence threshold
- Multiple input modes (quick/detailed/examples)
- Visual result display
- API health monitoring
### Demo Walkthrough
#### Main Dashboard
![gui_main_dashboard](docs/img/gui_main_dashboard.png)
The main interface provides:
- **Sidebar**: API health status, confidence threshold slider, model info
- **Three input modes**: Quick Input, Detailed Input, Examples
#### Quick Input Mode
![gui_quick_input](docs/img/gui_quick_input.png)
Simply paste your GitHub issue text and click "Predict Skills"!
#### Prediction Results
![gui_detailed](docs/img/gui_detailed.png)
View:
- **Top predictions** with confidence scores
- **Full predictions table** with filtering
- **Processing metrics** (time, model version)
- **Raw JSON response** (expandable)
#### Detailed Input Mode
![gui_detailed_input](docs/img/gui_detailed_input.png)
Add optional metadata:
- Repository name
- PR number
- Detailed description
#### Example Gallery
![gui_ex](docs/img/gui_ex.png)
Test with pre-loaded examples:
- Authentication bugs
- ML features
- Database issues
- UI enhancements
### Usage
1. Enter GitHub issue/PR text in the input area
2. (Optional) Add description, repo name, PR number
3. Click "Predict Skills"
4. View results with confidence scores
5. Adjust threshold slider to filter predictions