Hopcroft-Skill-Classification

Sleeping

File size: 21,587 Bytes

---
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs
---

# Hopcroft_Skill-Classification-Tool-Competition

The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.

## Project Organization

```
├── LICENSE            <- Open-source license if one is chosen
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default mkdocs project; see www.mkdocs.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml     <- Project configuration file with package metadata for 
│                         hopcroft_skill_classification_tool_competition and configuration for tools like black
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.cfg          <- Configuration file for flake8
│
└── hopcroft_skill_classification_tool_competition   <- Source code for use in this project.
    │
    ├── __init__.py             <- Makes hopcroft_skill_classification_tool_competition a Python module
    │
    ├── config.py               <- Store useful variables and configuration
    │
    ├── dataset.py              <- Scripts to download or generate data
    │
    ├── features.py             <- Code to create features for modeling
    │
    ├── modeling                
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    │
    └── plots.py                <- Code to create visualizations
```

--------

## Setup

### MLflow Credentials Configuration

Set up DagsHub credentials for MLflow tracking.

**Get your token:** [DagsHub](https://dagshub.com) → Profile → Settings → Tokens

#### Option 1: Using `.env` file (Recommended for local development)

```bash
# Copy the template
cp .env.example .env

# Edit .env with your credentials
```

Your `.env` file should contain:
```
MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_token
```

> [!NOTE]
> The `.env` file is git-ignored for security. Never commit credentials to version control.

#### Option 2: Using Docker Compose

When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`.

```bash
# Start the service (credentials loaded from .env)
docker compose up --build
```

--------

## CI Configuration

[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)

This project uses automatically triggered GitHub Actions triggers for Continuous Integration.

### Secrets

To enable DVC model pulling, configure these Repository Secrets:

- `DAGSHUB_USERNAME`: DagsHub username.
- `DAGSHUB_TOKEN`: DagsHub access token.

--------

## Milestone Summary

### Milestone 1
We compiled the ML Canvas and defined:
- Problem: multi-label classification of skills for PR/issues.
- Stakeholders and business/research goals.
- Data sources (SkillScope DB) and constraints (no external classifiers).
- Success metrics (micro-F1, imbalance handling, experiment tracking).
- Risks (label imbalance, text noise, multi-label complexity) and mitigations.

### Milestone 2
We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:

1. Data Management
    - DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.

2. Data Ingestion & EDA
    - `dataset.py` to download/extract SkillScope from Hugging Face (zip → SQLite) with cleanup.
    - Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution).

3. Feature Engineering
    - `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`).

4. Central Config
    - `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.

5. Modeling & Experiments
    - Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
    - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).

6. Imbalance Handling
    - Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline.

7. Tracking & Reproducibility
    - Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).

8. Tooling
    - Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`).

### Milestone 3 (QA)
We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:

1. **Data Cleaning Pipeline**
    - `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
    - Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.

2. **Great Expectations Validation** (10 tests)
    - Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
    - Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
    - All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`.

3. **Deepchecks Validation** (24 checks across 2 suites)
    - Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
    - Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis.
    - Cleaned data achieved production-ready status (96% overall score).

4. **Behavioral Testing** (36 tests)
    - Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
    - Directional tests (10): keyword addition effects, technical detail impact on predictions.
    - Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
    - All tests passed; comprehensive report in `reports/behavioral/`.

5. **Code Quality Analysis**
    - Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
    - PEP 8 compliant, Black compatible (line length 88).

6. **Documentation**
    - Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results.
    - Behavioral testing README with test categories, usage examples, and extension guide.

7. **Tooling**
    - Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`.
    - Automated test execution and report generation.

### Milestone 4 (API)
We implemented a production-ready FastAPI service for skill prediction with MLflow integration:

#### Features
- **REST API Endpoints**:
  - `POST /predict` - Predict skills for a GitHub issue (logs to MLflow)
  - `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID
  - `GET /predictions` - List recent predictions with pagination
  - `GET /health` - Health check endpoint
- **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/`
- **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps
- **Input Validation**: Pydantic models for request/response validation
- **Interactive Docs**: Auto-generated Swagger UI and ReDoc

#### API Usage

**1. Start the API Server**
```bash
# Development mode (auto-reload)
make api-dev

# Production mode
make api-run
```
Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000)

**2. Test Endpoints**

**Option A: Swagger UI (Recommended)**
- Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
- Interactive interface to test all endpoints
- View request/response schemas

**Option B: Make Commands**
```bash
# Test all endpoints
make test-api-all

# Individual endpoints
make test-api-health        # Health check
make test-api-predict       # Single prediction
make test-api-list          # List predictions
```

#### Prerequisites
- Trained model: `models/random_forest_tfidf_gridsearch.pkl`
- TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation)
- Label names: `models/label_names.pkl` (auto-saved during feature creation)

#### MLflow Integration
- All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
- Experiment: `skill_prediction_api`
- Tracked: input text, predictions, probabilities, metadata

#### Docker
Build and run the API in a container:
```bash
docker build -t hopcroft-api .
docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
```

Endpoints:
- Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs)
- Health check: [http://localhost:8080/health](http://localhost:8080/health)

---

## Docker Compose Usage

Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration.

### Prerequisites

1. **Create your environment file:**
   ```bash
   cp .env.example .env
   ```

2. **Edit `.env`** with your actual credentials:
   ```
   MLFLOW_TRACKING_USERNAME=your_dagshub_username
   MLFLOW_TRACKING_PASSWORD=your_dagshub_token
   ```
   
   Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens)

### Quick Start

#### 1. Build and Start All Services
Build both images and start the containers:
```bash
docker-compose up -d --build
```

| Flag | Description |
|------|-------------|
| `-d` | Run in detached mode (background) |
| `--build` | Rebuild images before starting (use when code/Dockerfile changes) |

**Available Services:**
- **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs)
- **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501)
- **Health Check:** [http://localhost:8080/health](http://localhost:8080/health)

#### 2. Stop All Services
Stop and remove containers and networks:
```bash
docker-compose down
```

| Flag | Description |
|------|-------------|
| `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` |
| `--rmi all` | Also remove images: `docker-compose down --rmi all` |

#### 3. Restart Services
After updating `.env` or configuration files:
```bash
docker-compose restart
```

Or for a full restart with environment reload:
```bash
docker-compose down
docker-compose up -d
```

#### 4. Check Status
View the status of all running services:
```bash
docker-compose ps
```

Or use Docker commands:
```bash
docker ps
```

#### 5. View Logs
Tail logs from both services in real-time:
```bash
docker-compose logs -f
```

View logs from a specific service:
```bash
docker-compose logs -f hopcroft-api
docker-compose logs -f hopcroft-gui
```

| Flag | Description |
|------|-------------|
| `-f` | Follow log output (stream new logs) |
| `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` |

#### 6. Execute Commands in Container
Open an interactive shell inside a running container:
```bash
docker-compose exec hopcroft-api /bin/bash
docker-compose exec hopcroft-gui /bin/bash
```

Examples of useful commands inside the API container:
```bash
# Check installed packages
pip list

# Run Python interactively
python

# Check model file exists
ls -la /app/models/

# Verify environment variables
printenv | grep MLFLOW
```
```

### Architecture Overview

**Docker Compose orchestrates two services:**

```
docker-compose.yml
├── hopcroft-api (FastAPI Backend)
│   ├── Build: ./Dockerfile
│   ├── Port: 8080:8080
│   ├── Network: hopcroft-net
│   ├── Environment: .env (MLflow credentials)
│   ├── Volumes:
│   │   ├── ./hopcroft_skill_classification_tool_competition (hot reload)
│   │   └── hopcroft-logs:/app/logs (persistent logs)
│   └── Health Check: /health endpoint
│
├── hopcroft-gui (Streamlit Frontend)
│   ├── Build: ./Dockerfile.streamlit
│   ├── Port: 8501:8501
│   ├── Network: hopcroft-net
│   ├── Environment: API_BASE_URL=http://hopcroft-api:8080
│   ├── Volumes:
│   │   └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload)
│   └── Depends on: hopcroft-api (waits for health check)
│
└── hopcroft-net (bridge network)
```

**External Access:**
- API: http://localhost:8080
- GUI: http://localhost:8501

**Internal Communication:**
- GUI → API: http://hopcroft-api:8080 (via Docker network)

### Services Description

**hopcroft-api (FastAPI Backend)**
- Purpose: FastAPI backend serving the ML model for skill classification
- Image: Built from `Dockerfile`
- Port: 8080 (maps to host 8080)
- Features:
  - Random Forest model with embedding features
  - MLflow experiment tracking
  - Auto-reload in development mode
  - Health check endpoint

**hopcroft-gui (Streamlit Frontend)**
- Purpose: Streamlit web interface for interactive predictions
- Image: Built from `Dockerfile.streamlit`
- Port: 8501 (maps to host 8501)
- Features:
  - User-friendly interface for skill prediction
  - Real-time communication with API
  - Automatic reconnection on API restart
  - Depends on API health before starting

### Development vs Production

**Development (default):**
- Auto-reload enabled (`--reload`)
- Source code mounted with bind mounts
- Custom command with hot reload
- GUI → API via Docker network

**Production:**
- Auto-reload disabled
- Use built image only
- Use Dockerfile's CMD
- GUI → API via Docker network

For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.

### Troubleshooting

#### Issue: GUI shows "API is not available"
**Solution:**
1. Wait 30-60 seconds for API to fully initialize and become healthy
2. Refresh the GUI page (F5)
3. Check API health: `curl http://localhost:8080/health`
4. Check logs: `docker-compose logs hopcroft-api`

#### Issue: "500 Internal Server Error" on predictions
**Solution:**
1. Verify MLflow credentials in `.env` are correct
2. Restart services: `docker-compose down && docker-compose up -d`
3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`

#### Issue: Changes to code not reflected
**Solution:**
- For Python code changes: Auto-reload is enabled, wait a few seconds
- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`

#### Issue: Port already in use
**Solution:**
```bash
# Check what's using the port
netstat -ano | findstr :8080
netstat -ano | findstr :8501

# Stop existing containers
docker-compose down

# Or change ports in docker-compose.yml
```


--------

## Hugging Face Spaces Deployment

This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.

### 1. Setup Space
1. Create a new Space on Hugging Face.
2. Select **Docker** as the SDK.
3. Choose the **Blank** template or upload your code.

### 2. Configure Secrets
To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:

| Name | Type | Description |
|------|------|-------------|
| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |

> [!IMPORTANT]
> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.

### 3. Automated Startup
The deployment follows this automated flow:
1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
2. **scripts/start_space.sh**: 
   - Configures DVC with your secrets.
   - Pulls models from the DagsHub remote.
   - Starts the **FastAPI** backend (port 8000).
   - Starts the **Streamlit** frontend (port 8501).
   - Starts **Nginx** (port 7860) as a reverse proxy to route traffic.

### 4. Direct Access
Once deployed, your Space will be available at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`

The API documentation will be accessible at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`

--------

## Demo UI (Streamlit)

The Streamlit GUI provides an interactive web interface for the skill classification API.

### Features
- Real-time skill prediction from GitHub issue text
- Top-5 predicted skills with confidence scores
- Full predictions table with all skills
- API connection status indicator
- Responsive design

### Usage
1. Ensure both services are running: `docker-compose up -d`
2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501)
3. Enter a GitHub issue description in the text area
4. Click "Predict Skills" to get predictions
5. View results in the predictions table

### Architecture
- **Frontend**: Streamlit (Python web framework)
- **Communication**: HTTP requests to FastAPI backend via Docker network
- **Independence**: GUI and API run in separate containers
- **Auto-reload**: GUI code changes are reflected immediately (bind mount) 
> Both must run **simultaneously** in different terminals/containers.

### Quick Start

1. **Start the FastAPI backend:**
   ```bash
   fastapi dev hopcroft_skill_classification_tool_competition/main.py
   ```

2. **In a new terminal, start Streamlit:**
   ```bash
   streamlit run streamlit_app.py
   ```

3. **Open your browser:**
   - Streamlit UI: http://localhost:8501
   - FastAPI Docs: http://localhost:8000/docs

### Features

- Interactive web interface for skill prediction
- Real-time predictions with confidence scores
- Adjustable confidence threshold
- Multiple input modes (quick/detailed/examples)
- Visual result display
- API health monitoring

### Demo Walkthrough

#### Main Dashboard

![gui_main_dashboard](docs/img/gui_main_dashboard.png)

The main interface provides:
- **Sidebar**: API health status, confidence threshold slider, model info
- **Three input modes**: Quick Input, Detailed Input, Examples
#### Quick Input Mode

![gui_quick_input](docs/img/gui_quick_input.png)
Simply paste your GitHub issue text and click "Predict Skills"!

#### Prediction Results
![gui_detailed](docs/img/gui_detailed.png)
View:
- **Top predictions** with confidence scores
- **Full predictions table** with filtering
- **Processing metrics** (time, model version)
- **Raw JSON response** (expandable)

#### Detailed Input Mode

![gui_detailed_input](docs/img/gui_detailed_input.png)
Add optional metadata:
- Repository name
- PR number
- Detailed description

#### Example Gallery
![gui_ex](docs/img/gui_ex.png)

Test with pre-loaded examples:
- Authentication bugs
- ML features
- Database issues
- UI enhancements
  

### Usage

1. Enter GitHub issue/PR text in the input area
2. (Optional) Add description, repo name, PR number
3. Click "Predict Skills"
4. View results with confidence scores
5. Adjust threshold slider to filter predictions