title: Hopcroft Skill Classification
emoji: π§
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs
Hopcroft_Skill-Classification-Tool-Competition
The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.
Project Organization
βββ LICENSE <- Open-source license if one is chosen
βββ Makefile <- Makefile with convenience commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project.
βββ data
β βββ external <- Data from third party sources.
β βββ interim <- Intermediate data that has been transformed.
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ docs <- A default mkdocs project; see www.mkdocs.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ pyproject.toml <- Project configuration file with package metadata for
β hopcroft_skill_classification_tool_competition and configuration for tools like black
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.cfg <- Configuration file for flake8
β
βββ hopcroft_skill_classification_tool_competition <- Source code for use in this project.
β
βββ __init__.py <- Makes hopcroft_skill_classification_tool_competition a Python module
β
βββ config.py <- Store useful variables and configuration
β
βββ dataset.py <- Scripts to download or generate data
β
βββ features.py <- Code to create features for modeling
β
βββ modeling
β βββ __init__.py
β βββ predict.py <- Code to run model inference with trained models
β βββ train.py <- Code to train models
β
βββ plots.py <- Code to create visualizations
Setup
MLflow Credentials Configuration
Set up DagsHub credentials for MLflow tracking.
Get your token: DagsHub β Profile β Settings β Tokens
Option 1: Using .env file (Recommended for local development)
# Copy the template
cp .env.example .env
# Edit .env with your credentials
Your .env file should contain:
MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_token
The
.envfile is git-ignored for security. Never commit credentials to version control.
Option 2: Using Docker Compose
When using Docker Compose, the .env file is automatically loaded via env_file directive in docker-compose.yml.
# Start the service (credentials loaded from .env)
docker compose up --build
CI Configuration
This project uses automatically triggered GitHub Actions triggers for Continuous Integration.
Secrets
To enable DVC model pulling, configure these Repository Secrets:
DAGSHUB_USERNAME: DagsHub username.DAGSHUB_TOKEN: DagsHub access token.
Milestone Summary
Milestone 1
We compiled the ML Canvas and defined:
- Problem: multi-label classification of skills for PR/issues.
- Stakeholders and business/research goals.
- Data sources (SkillScope DB) and constraints (no external classifiers).
- Success metrics (micro-F1, imbalance handling, experiment tracking).
- Risks (label imbalance, text noise, multi-label complexity) and mitigations.
Milestone 2
We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:
Data Management
- DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.
Data Ingestion & EDA
dataset.pyto download/extract SkillScope from Hugging Face (zip β SQLite) with cleanup.- Initial exploration notebook
notebooks/1.0-initial-data-exploration.ipynb(schema, text stats, label distribution).
Feature Engineering
features.py: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (features_tfidf.npy,labels_tfidf.npy).
Central Config
config.pywith project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.
Modeling & Experiments
- Unified
modeling/train.pywith actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference. - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).
- Unified
Imbalance Handling
- Local
mlsmote.py(multi-label oversampling) with fallback toRandomOverSampler; dedicated ADASYN+PCA pipeline.
- Local
Tracking & Reproducibility
- Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).
Tooling
- Updated
requirements.txt(lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (data,features).
- Updated
Milestone 3 (QA)
We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:
Data Cleaning Pipeline
data_cleaning.py: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.- Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.
Great Expectations Validation (10 tests)
- Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
- Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
- All 10 tests pass on cleaned data; comprehensive JSON reports in
reports/great_expectations/.
Deepchecks Validation (24 checks across 2 suites)
- Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
- Train-Test Validation Suite (100% score): zero data leakage, proper train/test split, feature/label drift analysis.
- Cleaned data achieved production-ready status (96% overall score).
Behavioral Testing (36 tests)
- Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
- Directional tests (10): keyword addition effects, technical detail impact on predictions.
- Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
- All tests passed; comprehensive report in
reports/behavioral/.
Code Quality Analysis
- Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
- PEP 8 compliant, Black compatible (line length 88).
Documentation
- Comprehensive
docs/testing_and_validation.mdwith detailed test descriptions, execution commands, and analysis results. - Behavioral testing README with test categories, usage examples, and extension guide.
- Comprehensive
Tooling
- Makefile targets:
validate-gx,validate-deepchecks,test-behavioral,test-complete. - Automated test execution and report generation.
- Makefile targets:
Milestone 4 (API)
We implemented a production-ready FastAPI service for skill prediction with MLflow integration:
Features
- REST API Endpoints:
POST /predict- Predict skills for a GitHub issue (logs to MLflow)GET /predictions/{run_id}- Retrieve prediction by MLflow run IDGET /predictions- List recent predictions with paginationGET /health- Health check endpoint
- Model Management: Loads trained Random Forest + TF-IDF vectorizer from
models/ - MLflow Tracking: All predictions logged with metadata, probabilities, and timestamps
- Input Validation: Pydantic models for request/response validation
- Interactive Docs: Auto-generated Swagger UI and ReDoc
API Usage
1. Start the API Server
# Development mode (auto-reload)
make api-dev
# Production mode
make api-run
Server starts at: http://127.0.0.1:8000
2. Test Endpoints
Option A: Swagger UI (Recommended)
- Navigate to: http://127.0.0.1:8000/docs
- Interactive interface to test all endpoints
- View request/response schemas
Option B: Make Commands
# Test all endpoints
make test-api-all
# Individual endpoints
make test-api-health # Health check
make test-api-predict # Single prediction
make test-api-list # List predictions
Prerequisites
- Trained model:
models/random_forest_tfidf_gridsearch.pkl - TF-IDF vectorizer:
models/tfidf_vectorizer.pkl(auto-saved during feature creation) - Label names:
models/label_names.pkl(auto-saved during feature creation)
MLflow Integration
- All predictions logged to:
https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow - Experiment:
skill_prediction_api - Tracked: input text, predictions, probabilities, metadata
Docker
Build and run the API in a container:
docker build -t hopcroft-api .
docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
Endpoints:
- Swagger UI: http://localhost:8080/docs
- Health check: http://localhost:8080/health
Docker Compose Usage
Docker Compose orchestrates both the API backend and Streamlit GUI services with proper networking and configuration.
Prerequisites
Create your environment file:
cp .env.example .envEdit
.envwith your actual credentials:MLFLOW_TRACKING_USERNAME=your_dagshub_username MLFLOW_TRACKING_PASSWORD=your_dagshub_tokenGet your token from: https://dagshub.com/user/settings/tokens
Quick Start
1. Build and Start All Services
Build both images and start the containers:
docker-compose up -d --build
| Flag | Description |
|---|---|
-d |
Run in detached mode (background) |
--build |
Rebuild images before starting (use when code/Dockerfile changes) |
Available Services:
- API (FastAPI): http://localhost:8080/docs
- GUI (Streamlit): http://localhost:8501
- Health Check: http://localhost:8080/health
2. Stop All Services
Stop and remove containers and networks:
docker-compose down
| Flag | Description |
|---|---|
-v |
Also remove named volumes (e.g., hopcroft-logs): docker-compose down -v |
--rmi all |
Also remove images: docker-compose down --rmi all |
3. Restart Services
After updating .env or configuration files:
docker-compose restart
Or for a full restart with environment reload:
docker-compose down
docker-compose up -d
4. Check Status
View the status of all running services:
docker-compose ps
Or use Docker commands:
docker ps
5. View Logs
Tail logs from both services in real-time:
docker-compose logs -f
View logs from a specific service:
docker-compose logs -f hopcroft-api
docker-compose logs -f hopcroft-gui
| Flag | Description |
|---|---|
-f |
Follow log output (stream new logs) |
--tail 100 |
Show only last 100 lines: docker-compose logs --tail 100 |
6. Execute Commands in Container
Open an interactive shell inside a running container:
docker-compose exec hopcroft-api /bin/bash
docker-compose exec hopcroft-gui /bin/bash
Examples of useful commands inside the API container:
# Check installed packages
pip list
# Run Python interactively
python
# Check model file exists
ls -la /app/models/
# Verify environment variables
printenv | grep MLFLOW
### Architecture Overview
**Docker Compose orchestrates two services:**
docker-compose.yml βββ hopcroft-api (FastAPI Backend) β βββ Build: ./Dockerfile β βββ Port: 8080:8080 β βββ Network: hopcroft-net β βββ Environment: .env (MLflow credentials) β βββ Volumes: β β βββ ./hopcroft_skill_classification_tool_competition (hot reload) β β βββ hopcroft-logs:/app/logs (persistent logs) β βββ Health Check: /health endpoint β βββ hopcroft-gui (Streamlit Frontend) β βββ Build: ./Dockerfile.streamlit β βββ Port: 8501:8501 β βββ Network: hopcroft-net β βββ Environment: API_BASE_URL=http://hopcroft-api:8080 β βββ Volumes: β β βββ ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload) β βββ Depends on: hopcroft-api (waits for health check) β βββ hopcroft-net (bridge network)
**External Access:**
- API: http://localhost:8080
- GUI: http://localhost:8501
**Internal Communication:**
- GUI β API: http://hopcroft-api:8080 (via Docker network)
### Services Description
**hopcroft-api (FastAPI Backend)**
- Purpose: FastAPI backend serving the ML model for skill classification
- Image: Built from `Dockerfile`
- Port: 8080 (maps to host 8080)
- Features:
- Random Forest model with embedding features
- MLflow experiment tracking
- Auto-reload in development mode
- Health check endpoint
**hopcroft-gui (Streamlit Frontend)**
- Purpose: Streamlit web interface for interactive predictions
- Image: Built from `Dockerfile.streamlit`
- Port: 8501 (maps to host 8501)
- Features:
- User-friendly interface for skill prediction
- Real-time communication with API
- Automatic reconnection on API restart
- Depends on API health before starting
### Development vs Production
**Development (default):**
- Auto-reload enabled (`--reload`)
- Source code mounted with bind mounts
- Custom command with hot reload
- GUI β API via Docker network
**Production:**
- Auto-reload disabled
- Use built image only
- Use Dockerfile's CMD
- GUI β API via Docker network
For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.
### Troubleshooting
#### Issue: GUI shows "API is not available"
**Solution:**
1. Wait 30-60 seconds for API to fully initialize and become healthy
2. Refresh the GUI page (F5)
3. Check API health: `curl http://localhost:8080/health`
4. Check logs: `docker-compose logs hopcroft-api`
#### Issue: "500 Internal Server Error" on predictions
**Solution:**
1. Verify MLflow credentials in `.env` are correct
2. Restart services: `docker-compose down && docker-compose up -d`
3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`
#### Issue: Changes to code not reflected
**Solution:**
- For Python code changes: Auto-reload is enabled, wait a few seconds
- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`
#### Issue: Port already in use
**Solution:**
```bash
# Check what's using the port
netstat -ano | findstr :8080
netstat -ano | findstr :8501
# Stop existing containers
docker-compose down
# Or change ports in docker-compose.yml
Demo UI (Streamlit)
The Streamlit GUI provides an interactive web interface for the skill classification API.
Features
- Real-time skill prediction from GitHub issue text
- Top-5 predicted skills with confidence scores
- Full predictions table with all skills
- API connection status indicator
- Responsive design
Usage
- Ensure both services are running:
docker-compose up -d - Open the GUI in your browser: http://localhost:8501
- Enter a GitHub issue description in the text area
- Click "Predict Skills" to get predictions
- View results in the predictions table
Architecture
- Frontend: Streamlit (Python web framework)
- Communication: HTTP requests to FastAPI backend via Docker network
- Independence: GUI and API run in separate containers
- Auto-reload: GUI code changes are reflected immediately (bind mount)
Both must run simultaneously in different terminals/containers.
Quick Start
Start the FastAPI backend:
fastapi dev hopcroft_skill_classification_tool_competition/main.pyIn a new terminal, start Streamlit:
streamlit run streamlit_app.pyOpen your browser:
- Streamlit UI: http://localhost:8501
- FastAPI Docs: http://localhost:8000/docs
Features
- Interactive web interface for skill prediction
- Real-time predictions with confidence scores
- Adjustable confidence threshold
- Multiple input modes (quick/detailed/examples)
- Visual result display
- API health monitoring
Demo Walkthrough
Main Dashboard
The main interface provides:
- Sidebar: API health status, confidence threshold slider, model info
- Three input modes: Quick Input, Detailed Input, Examples
Quick Input Mode
Simply paste your GitHub issue text and click "Predict Skills"!
Prediction Results
- Top predictions with confidence scores
- Full predictions table with filtering
- Processing metrics (time, model version)
- Raw JSON response (expandable)
Detailed Input Mode
- Repository name
- PR number
- Detailed description
Example Gallery
Test with pre-loaded examples:
- Authentication bugs
- ML features
- Database issues
- UI enhancements
Usage
- Enter GitHub issue/PR text in the input area
- (Optional) Add description, repo name, PR number
- Click "Predict Skills"
- View results with confidence scores
- Adjust threshold slider to filter predictions



