DaCrow13
feat: Add Nginx reverse proxy configuration and a new startup script to serve FastAPI and Streamlit applications.
9e1edfd
|
raw
history blame
20 kB
metadata
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs

Hopcroft_Skill-Classification-Tool-Competition

The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.

Project Organization

β”œβ”€β”€ LICENSE            <- Open-source license if one is chosen
β”œβ”€β”€ Makefile           <- Makefile with convenience commands like `make data` or `make train`
β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ external       <- Data from third party sources.
β”‚   β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default mkdocs project; see www.mkdocs.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ pyproject.toml     <- Project configuration file with package metadata for 
β”‚                         hopcroft_skill_classification_tool_competition and configuration for tools like black
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.cfg          <- Configuration file for flake8
β”‚
└── hopcroft_skill_classification_tool_competition   <- Source code for use in this project.
    β”‚
    β”œβ”€β”€ __init__.py             <- Makes hopcroft_skill_classification_tool_competition a Python module
    β”‚
    β”œβ”€β”€ config.py               <- Store useful variables and configuration
    β”‚
    β”œβ”€β”€ dataset.py              <- Scripts to download or generate data
    β”‚
    β”œβ”€β”€ features.py             <- Code to create features for modeling
    β”‚
    β”œβ”€β”€ modeling                
    β”‚   β”œβ”€β”€ __init__.py 
    β”‚   β”œβ”€β”€ predict.py          <- Code to run model inference with trained models          
    β”‚   └── train.py            <- Code to train models
    β”‚
    └── plots.py                <- Code to create visualizations

Setup

MLflow Credentials Configuration

Set up DagsHub credentials for MLflow tracking.

Get your token: DagsHub β†’ Profile β†’ Settings β†’ Tokens

Option 1: Using .env file (Recommended for local development)

# Copy the template
cp .env.example .env

# Edit .env with your credentials

Your .env file should contain:

MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_token

The .env file is git-ignored for security. Never commit credentials to version control.

Option 2: Using Docker Compose

When using Docker Compose, the .env file is automatically loaded via env_file directive in docker-compose.yml.

# Start the service (credentials loaded from .env)
docker compose up --build

CI Configuration

CI Pipeline

This project uses automatically triggered GitHub Actions triggers for Continuous Integration.

Secrets

To enable DVC model pulling, configure these Repository Secrets:

  • DAGSHUB_USERNAME: DagsHub username.
  • DAGSHUB_TOKEN: DagsHub access token.

Milestone Summary

Milestone 1

We compiled the ML Canvas and defined:

  • Problem: multi-label classification of skills for PR/issues.
  • Stakeholders and business/research goals.
  • Data sources (SkillScope DB) and constraints (no external classifiers).
  • Success metrics (micro-F1, imbalance handling, experiment tracking).
  • Risks (label imbalance, text noise, multi-label complexity) and mitigations.

Milestone 2

We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:

  1. Data Management

    • DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.
  2. Data Ingestion & EDA

    • dataset.py to download/extract SkillScope from Hugging Face (zip β†’ SQLite) with cleanup.
    • Initial exploration notebook notebooks/1.0-initial-data-exploration.ipynb (schema, text stats, label distribution).
  3. Feature Engineering

    • features.py: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (features_tfidf.npy, labels_tfidf.npy).
  4. Central Config

    • config.py with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.
  5. Modeling & Experiments

    • Unified modeling/train.py with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
    • GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).
  6. Imbalance Handling

    • Local mlsmote.py (multi-label oversampling) with fallback to RandomOverSampler; dedicated ADASYN+PCA pipeline.
  7. Tracking & Reproducibility

    • Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).
  8. Tooling

    • Updated requirements.txt (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (data, features).

Milestone 3 (QA)

We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:

  1. Data Cleaning Pipeline

    • data_cleaning.py: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
    • Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.
  2. Great Expectations Validation (10 tests)

    • Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
    • Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
    • All 10 tests pass on cleaned data; comprehensive JSON reports in reports/great_expectations/.
  3. Deepchecks Validation (24 checks across 2 suites)

    • Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
    • Train-Test Validation Suite (100% score): zero data leakage, proper train/test split, feature/label drift analysis.
    • Cleaned data achieved production-ready status (96% overall score).
  4. Behavioral Testing (36 tests)

    • Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
    • Directional tests (10): keyword addition effects, technical detail impact on predictions.
    • Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
    • All tests passed; comprehensive report in reports/behavioral/.
  5. Code Quality Analysis

    • Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
    • PEP 8 compliant, Black compatible (line length 88).
  6. Documentation

    • Comprehensive docs/testing_and_validation.md with detailed test descriptions, execution commands, and analysis results.
    • Behavioral testing README with test categories, usage examples, and extension guide.
  7. Tooling

    • Makefile targets: validate-gx, validate-deepchecks, test-behavioral, test-complete.
    • Automated test execution and report generation.

Milestone 4 (API)

We implemented a production-ready FastAPI service for skill prediction with MLflow integration:

Features

  • REST API Endpoints:
    • POST /predict - Predict skills for a GitHub issue (logs to MLflow)
    • GET /predictions/{run_id} - Retrieve prediction by MLflow run ID
    • GET /predictions - List recent predictions with pagination
    • GET /health - Health check endpoint
  • Model Management: Loads trained Random Forest + TF-IDF vectorizer from models/
  • MLflow Tracking: All predictions logged with metadata, probabilities, and timestamps
  • Input Validation: Pydantic models for request/response validation
  • Interactive Docs: Auto-generated Swagger UI and ReDoc

API Usage

1. Start the API Server

# Development mode (auto-reload)
make api-dev

# Production mode
make api-run

Server starts at: http://127.0.0.1:8000

2. Test Endpoints

Option A: Swagger UI (Recommended)

Option B: Make Commands

# Test all endpoints
make test-api-all

# Individual endpoints
make test-api-health        # Health check
make test-api-predict       # Single prediction
make test-api-list          # List predictions

Prerequisites

  • Trained model: models/random_forest_tfidf_gridsearch.pkl
  • TF-IDF vectorizer: models/tfidf_vectorizer.pkl (auto-saved during feature creation)
  • Label names: models/label_names.pkl (auto-saved during feature creation)

MLflow Integration

  • All predictions logged to: https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
  • Experiment: skill_prediction_api
  • Tracked: input text, predictions, probabilities, metadata

Docker

Build and run the API in a container:

docker build -t hopcroft-api .
docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api

Endpoints:


Docker Compose Usage

Docker Compose orchestrates both the API backend and Streamlit GUI services with proper networking and configuration.

Prerequisites

  1. Create your environment file:

    cp .env.example .env
    
  2. Edit .env with your actual credentials:

    MLFLOW_TRACKING_USERNAME=your_dagshub_username
    MLFLOW_TRACKING_PASSWORD=your_dagshub_token
    

    Get your token from: https://dagshub.com/user/settings/tokens

Quick Start

1. Build and Start All Services

Build both images and start the containers:

docker-compose up -d --build
Flag Description
-d Run in detached mode (background)
--build Rebuild images before starting (use when code/Dockerfile changes)

Available Services:

2. Stop All Services

Stop and remove containers and networks:

docker-compose down
Flag Description
-v Also remove named volumes (e.g., hopcroft-logs): docker-compose down -v
--rmi all Also remove images: docker-compose down --rmi all

3. Restart Services

After updating .env or configuration files:

docker-compose restart

Or for a full restart with environment reload:

docker-compose down
docker-compose up -d

4. Check Status

View the status of all running services:

docker-compose ps

Or use Docker commands:

docker ps

5. View Logs

Tail logs from both services in real-time:

docker-compose logs -f

View logs from a specific service:

docker-compose logs -f hopcroft-api
docker-compose logs -f hopcroft-gui
Flag Description
-f Follow log output (stream new logs)
--tail 100 Show only last 100 lines: docker-compose logs --tail 100

6. Execute Commands in Container

Open an interactive shell inside a running container:

docker-compose exec hopcroft-api /bin/bash
docker-compose exec hopcroft-gui /bin/bash

Examples of useful commands inside the API container:

# Check installed packages
pip list

# Run Python interactively
python

# Check model file exists
ls -la /app/models/

# Verify environment variables
printenv | grep MLFLOW

### Architecture Overview

**Docker Compose orchestrates two services:**

docker-compose.yml β”œβ”€β”€ hopcroft-api (FastAPI Backend) β”‚ β”œβ”€β”€ Build: ./Dockerfile β”‚ β”œβ”€β”€ Port: 8080:8080 β”‚ β”œβ”€β”€ Network: hopcroft-net β”‚ β”œβ”€β”€ Environment: .env (MLflow credentials) β”‚ β”œβ”€β”€ Volumes: β”‚ β”‚ β”œβ”€β”€ ./hopcroft_skill_classification_tool_competition (hot reload) β”‚ β”‚ └── hopcroft-logs:/app/logs (persistent logs) β”‚ └── Health Check: /health endpoint β”‚ β”œβ”€β”€ hopcroft-gui (Streamlit Frontend) β”‚ β”œβ”€β”€ Build: ./Dockerfile.streamlit β”‚ β”œβ”€β”€ Port: 8501:8501 β”‚ β”œβ”€β”€ Network: hopcroft-net β”‚ β”œβ”€β”€ Environment: API_BASE_URL=http://hopcroft-api:8080 β”‚ β”œβ”€β”€ Volumes: β”‚ β”‚ └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload) β”‚ └── Depends on: hopcroft-api (waits for health check) β”‚ └── hopcroft-net (bridge network)


**External Access:**
- API: http://localhost:8080
- GUI: http://localhost:8501

**Internal Communication:**
- GUI β†’ API: http://hopcroft-api:8080 (via Docker network)

### Services Description

**hopcroft-api (FastAPI Backend)**
- Purpose: FastAPI backend serving the ML model for skill classification
- Image: Built from `Dockerfile`
- Port: 8080 (maps to host 8080)
- Features:
  - Random Forest model with embedding features
  - MLflow experiment tracking
  - Auto-reload in development mode
  - Health check endpoint

**hopcroft-gui (Streamlit Frontend)**
- Purpose: Streamlit web interface for interactive predictions
- Image: Built from `Dockerfile.streamlit`
- Port: 8501 (maps to host 8501)
- Features:
  - User-friendly interface for skill prediction
  - Real-time communication with API
  - Automatic reconnection on API restart
  - Depends on API health before starting

### Development vs Production

**Development (default):**
- Auto-reload enabled (`--reload`)
- Source code mounted with bind mounts
- Custom command with hot reload
- GUI β†’ API via Docker network

**Production:**
- Auto-reload disabled
- Use built image only
- Use Dockerfile's CMD
- GUI β†’ API via Docker network

For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.

### Troubleshooting

#### Issue: GUI shows "API is not available"
**Solution:**
1. Wait 30-60 seconds for API to fully initialize and become healthy
2. Refresh the GUI page (F5)
3. Check API health: `curl http://localhost:8080/health`
4. Check logs: `docker-compose logs hopcroft-api`

#### Issue: "500 Internal Server Error" on predictions
**Solution:**
1. Verify MLflow credentials in `.env` are correct
2. Restart services: `docker-compose down && docker-compose up -d`
3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`

#### Issue: Changes to code not reflected
**Solution:**
- For Python code changes: Auto-reload is enabled, wait a few seconds
- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`

#### Issue: Port already in use
**Solution:**
```bash
# Check what's using the port
netstat -ano | findstr :8080
netstat -ano | findstr :8501

# Stop existing containers
docker-compose down

# Or change ports in docker-compose.yml

Demo UI (Streamlit)

The Streamlit GUI provides an interactive web interface for the skill classification API.

Features

  • Real-time skill prediction from GitHub issue text
  • Top-5 predicted skills with confidence scores
  • Full predictions table with all skills
  • API connection status indicator
  • Responsive design

Usage

  1. Ensure both services are running: docker-compose up -d
  2. Open the GUI in your browser: http://localhost:8501
  3. Enter a GitHub issue description in the text area
  4. Click "Predict Skills" to get predictions
  5. View results in the predictions table

Architecture

  • Frontend: Streamlit (Python web framework)
  • Communication: HTTP requests to FastAPI backend via Docker network
  • Independence: GUI and API run in separate containers
  • Auto-reload: GUI code changes are reflected immediately (bind mount)

    Both must run simultaneously in different terminals/containers.

Quick Start

  1. Start the FastAPI backend:

    fastapi dev hopcroft_skill_classification_tool_competition/main.py
    
  2. In a new terminal, start Streamlit:

    streamlit run streamlit_app.py
    
  3. Open your browser:

Features

  • Interactive web interface for skill prediction
  • Real-time predictions with confidence scores
  • Adjustable confidence threshold
  • Multiple input modes (quick/detailed/examples)
  • Visual result display
  • API health monitoring

Demo Walkthrough

Main Dashboard

gui_main_dashboard

The main interface provides:

  • Sidebar: API health status, confidence threshold slider, model info
  • Three input modes: Quick Input, Detailed Input, Examples

Quick Input Mode

gui_quick_input Simply paste your GitHub issue text and click "Predict Skills"!

Prediction Results

gui_detailed View:

  • Top predictions with confidence scores
  • Full predictions table with filtering
  • Processing metrics (time, model version)
  • Raw JSON response (expandable)

Detailed Input Mode

gui_detailed_input Add optional metadata:

  • Repository name
  • PR number
  • Detailed description

Example Gallery

gui_ex

Test with pre-loaded examples:

  • Authentication bugs
  • ML features
  • Database issues
  • UI enhancements

Usage

  1. Enter GitHub issue/PR text in the input area
  2. (Optional) Add description, repo name, PR number
  3. Click "Predict Skills"
  4. View results with confidence scores
  5. Adjust threshold slider to filter predictions