--- title: Hopcroft Skill Classification emoji: 🧠 colorFrom: blue colorTo: green sdk: docker app_port: 7860 api_docs_url: /docs --- # Hopcroft_Skill-Classification-Tool-Competition The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution. ## Project Organization ``` ├── LICENSE <- Open-source license if one is chosen ├── Makefile <- Makefile with convenience commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default mkdocs project; see www.mkdocs.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short `-` delimited description, e.g. │ `1.0-jqp-initial-data-exploration`. │ ├── pyproject.toml <- Project configuration file with package metadata for │ hopcroft_skill_classification_tool_competition and configuration for tools like black │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated with `pip freeze > requirements.txt` │ ├── setup.cfg <- Configuration file for flake8 │ └── hopcroft_skill_classification_tool_competition <- Source code for use in this project. │ ├── __init__.py <- Makes hopcroft_skill_classification_tool_competition a Python module │ ├── config.py <- Store useful variables and configuration │ ├── dataset.py <- Scripts to download or generate data │ ├── features.py <- Code to create features for modeling │ ├── modeling │ ├── __init__.py │ ├── predict.py <- Code to run model inference with trained models │ └── train.py <- Code to train models │ └── plots.py <- Code to create visualizations ``` -------- ## Setup ### MLflow Credentials Configuration Set up DagsHub credentials for MLflow tracking. **Get your token:** [DagsHub](https://dagshub.com) → Profile → Settings → Tokens #### Option 1: Using `.env` file (Recommended for local development) ```bash # Copy the template cp .env.example .env # Edit .env with your credentials ``` Your `.env` file should contain: ``` MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow MLFLOW_TRACKING_USERNAME=your_username MLFLOW_TRACKING_PASSWORD=your_token ``` > [!NOTE] > The `.env` file is git-ignored for security. Never commit credentials to version control. #### Option 2: Using Docker Compose When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`. ```bash # Start the service (credentials loaded from .env) docker compose up --build ``` -------- ## CI Configuration [![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml) This project uses automatically triggered GitHub Actions triggers for Continuous Integration. ### Secrets To enable DVC model pulling, configure these Repository Secrets: - `DAGSHUB_USERNAME`: DagsHub username. - `DAGSHUB_TOKEN`: DagsHub access token. -------- ## Milestone Summary ### Milestone 1 We compiled the ML Canvas and defined: - Problem: multi-label classification of skills for PR/issues. - Stakeholders and business/research goals. - Data sources (SkillScope DB) and constraints (no external classifiers). - Success metrics (micro-F1, imbalance handling, experiment tracking). - Risks (label imbalance, text noise, multi-label complexity) and mitigations. ### Milestone 2 We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments: 1. Data Management - DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models. 2. Data Ingestion & EDA - `dataset.py` to download/extract SkillScope from Hugging Face (zip → SQLite) with cleanup. - Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution). 3. Feature Engineering - `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`). 4. Central Config - `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants. 5. Modeling & Experiments - Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference. - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback). 6. Imbalance Handling - Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline. 7. Tracking & Reproducibility - Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices). 8. Tooling - Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`). ### Milestone 3 (QA) We implemented a comprehensive testing and validation framework to ensure data quality and model robustness: 1. **Data Cleaning Pipeline** - `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage. - Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split. 2. **Great Expectations Validation** (10 tests) - Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency. - Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency. - All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`. 3. **Deepchecks Validation** (24 checks across 2 suites) - Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation. - Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis. - Cleaned data achieved production-ready status (96% overall score). 4. **Behavioral Testing** (36 tests) - Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance. - Directional tests (10): keyword addition effects, technical detail impact on predictions. - Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps). - All tests passed; comprehensive report in `reports/behavioral/`. 5. **Code Quality Analysis** - Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable. - PEP 8 compliant, Black compatible (line length 88). 6. **Documentation** - Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results. - Behavioral testing README with test categories, usage examples, and extension guide. 7. **Tooling** - Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`. - Automated test execution and report generation. ### Milestone 4 (API) We implemented a production-ready FastAPI service for skill prediction with MLflow integration: #### Features - **REST API Endpoints**: - `POST /predict` - Predict skills for a GitHub issue (logs to MLflow) - `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID - `GET /predictions` - List recent predictions with pagination - `GET /health` - Health check endpoint - **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/` - **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps - **Input Validation**: Pydantic models for request/response validation - **Interactive Docs**: Auto-generated Swagger UI and ReDoc #### API Usage **1. Start the API Server** ```bash # Development mode (auto-reload) make api-dev # Production mode make api-run ``` Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000) **2. Test Endpoints** **Option A: Swagger UI (Recommended)** - Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) - Interactive interface to test all endpoints - View request/response schemas **Option B: Make Commands** ```bash # Test all endpoints make test-api-all # Individual endpoints make test-api-health # Health check make test-api-predict # Single prediction make test-api-list # List predictions ``` #### Prerequisites - Trained model: `models/random_forest_tfidf_gridsearch.pkl` - TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation) - Label names: `models/label_names.pkl` (auto-saved during feature creation) #### MLflow Integration - All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` - Experiment: `skill_prediction_api` - Tracked: input text, predictions, probabilities, metadata #### Docker Build and run the API in a container: ```bash docker build -t hopcroft-api . docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api ``` Endpoints: - Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs) - Health check: [http://localhost:8080/health](http://localhost:8080/health) --- ## Docker Compose Usage Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration. ### Prerequisites 1. **Create your environment file:** ```bash cp .env.example .env ``` 2. **Edit `.env`** with your actual credentials: ``` MLFLOW_TRACKING_USERNAME=your_dagshub_username MLFLOW_TRACKING_PASSWORD=your_dagshub_token ``` Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens) ### Quick Start #### 1. Build and Start All Services Build both images and start the containers: ```bash docker-compose up -d --build ``` | Flag | Description | |------|-------------| | `-d` | Run in detached mode (background) | | `--build` | Rebuild images before starting (use when code/Dockerfile changes) | **Available Services:** - **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs) - **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501) - **Health Check:** [http://localhost:8080/health](http://localhost:8080/health) #### 2. Stop All Services Stop and remove containers and networks: ```bash docker-compose down ``` | Flag | Description | |------|-------------| | `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` | | `--rmi all` | Also remove images: `docker-compose down --rmi all` | #### 3. Restart Services After updating `.env` or configuration files: ```bash docker-compose restart ``` Or for a full restart with environment reload: ```bash docker-compose down docker-compose up -d ``` #### 4. Check Status View the status of all running services: ```bash docker-compose ps ``` Or use Docker commands: ```bash docker ps ``` #### 5. View Logs Tail logs from both services in real-time: ```bash docker-compose logs -f ``` View logs from a specific service: ```bash docker-compose logs -f hopcroft-api docker-compose logs -f hopcroft-gui ``` | Flag | Description | |------|-------------| | `-f` | Follow log output (stream new logs) | | `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` | #### 6. Execute Commands in Container Open an interactive shell inside a running container: ```bash docker-compose exec hopcroft-api /bin/bash docker-compose exec hopcroft-gui /bin/bash ``` Examples of useful commands inside the API container: ```bash # Check installed packages pip list # Run Python interactively python # Check model file exists ls -la /app/models/ # Verify environment variables printenv | grep MLFLOW ``` ``` ### Architecture Overview **Docker Compose orchestrates two services:** ``` docker-compose.yml ├── hopcroft-api (FastAPI Backend) │ ├── Build: ./Dockerfile │ ├── Port: 8080:8080 │ ├── Network: hopcroft-net │ ├── Environment: .env (MLflow credentials) │ ├── Volumes: │ │ ├── ./hopcroft_skill_classification_tool_competition (hot reload) │ │ └── hopcroft-logs:/app/logs (persistent logs) │ └── Health Check: /health endpoint │ ├── hopcroft-gui (Streamlit Frontend) │ ├── Build: ./Dockerfile.streamlit │ ├── Port: 8501:8501 │ ├── Network: hopcroft-net │ ├── Environment: API_BASE_URL=http://hopcroft-api:8080 │ ├── Volumes: │ │ └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload) │ └── Depends on: hopcroft-api (waits for health check) │ └── hopcroft-net (bridge network) ``` **External Access:** - API: http://localhost:8080 - GUI: http://localhost:8501 **Internal Communication:** - GUI → API: http://hopcroft-api:8080 (via Docker network) ### Services Description **hopcroft-api (FastAPI Backend)** - Purpose: FastAPI backend serving the ML model for skill classification - Image: Built from `Dockerfile` - Port: 8080 (maps to host 8080) - Features: - Random Forest model with embedding features - MLflow experiment tracking - Auto-reload in development mode - Health check endpoint **hopcroft-gui (Streamlit Frontend)** - Purpose: Streamlit web interface for interactive predictions - Image: Built from `Dockerfile.streamlit` - Port: 8501 (maps to host 8501) - Features: - User-friendly interface for skill prediction - Real-time communication with API - Automatic reconnection on API restart - Depends on API health before starting ### Development vs Production **Development (default):** - Auto-reload enabled (`--reload`) - Source code mounted with bind mounts - Custom command with hot reload - GUI → API via Docker network **Production:** - Auto-reload disabled - Use built image only - Use Dockerfile's CMD - GUI → API via Docker network For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload. ### Troubleshooting #### Issue: GUI shows "API is not available" **Solution:** 1. Wait 30-60 seconds for API to fully initialize and become healthy 2. Refresh the GUI page (F5) 3. Check API health: `curl http://localhost:8080/health` 4. Check logs: `docker-compose logs hopcroft-api` #### Issue: "500 Internal Server Error" on predictions **Solution:** 1. Verify MLflow credentials in `.env` are correct 2. Restart services: `docker-compose down && docker-compose up -d` 3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW` #### Issue: Changes to code not reflected **Solution:** - For Python code changes: Auto-reload is enabled, wait a few seconds - For Dockerfile changes: Rebuild with `docker-compose up -d --build` - For `.env` changes: Restart with `docker-compose down && docker-compose up -d` #### Issue: Port already in use **Solution:** ```bash # Check what's using the port netstat -ano | findstr :8080 netstat -ano | findstr :8501 # Stop existing containers docker-compose down # Or change ports in docker-compose.yml ``` -------- ## Hugging Face Spaces Deployment This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker. ### 1. Setup Space 1. Create a new Space on Hugging Face. 2. Select **Docker** as the SDK. 3. Choose the **Blank** template or upload your code. ### 2. Configure Secrets To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings: | Name | Type | Description | |------|------|-------------| | `DAGSHUB_USERNAME` | Secret | Your DagsHub username. | | `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). | > [!IMPORTANT] > These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI. ### 3. Automated Startup The deployment follows this automated flow: 1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx. 2. **scripts/start_space.sh**: - Configures DVC with your secrets. - Pulls models from the DagsHub remote. - Starts the **FastAPI** backend (port 8000). - Starts the **Streamlit** frontend (port 8501). - Starts **Nginx** (port 7860) as a reverse proxy to route traffic. ### 4. Direct Access Once deployed, your Space will be available at: `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft` The API documentation will be accessible at: `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs` -------- ## Demo UI (Streamlit) The Streamlit GUI provides an interactive web interface for the skill classification API. ### Features - Real-time skill prediction from GitHub issue text - Top-5 predicted skills with confidence scores - Full predictions table with all skills - API connection status indicator - Responsive design ### Usage 1. Ensure both services are running: `docker-compose up -d` 2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501) 3. Enter a GitHub issue description in the text area 4. Click "Predict Skills" to get predictions 5. View results in the predictions table ### Architecture - **Frontend**: Streamlit (Python web framework) - **Communication**: HTTP requests to FastAPI backend via Docker network - **Independence**: GUI and API run in separate containers - **Auto-reload**: GUI code changes are reflected immediately (bind mount) > Both must run **simultaneously** in different terminals/containers. ### Quick Start 1. **Start the FastAPI backend:** ```bash fastapi dev hopcroft_skill_classification_tool_competition/main.py ``` 2. **In a new terminal, start Streamlit:** ```bash streamlit run streamlit_app.py ``` 3. **Open your browser:** - Streamlit UI: http://localhost:8501 - FastAPI Docs: http://localhost:8000/docs ### Features - Interactive web interface for skill prediction - Real-time predictions with confidence scores - Adjustable confidence threshold - Multiple input modes (quick/detailed/examples) - Visual result display - API health monitoring ### Demo Walkthrough #### Main Dashboard ![gui_main_dashboard](docs/img/gui_main_dashboard.png) The main interface provides: - **Sidebar**: API health status, confidence threshold slider, model info - **Three input modes**: Quick Input, Detailed Input, Examples #### Quick Input Mode ![gui_quick_input](docs/img/gui_quick_input.png) Simply paste your GitHub issue text and click "Predict Skills"! #### Prediction Results ![gui_detailed](docs/img/gui_detailed.png) View: - **Top predictions** with confidence scores - **Full predictions table** with filtering - **Processing metrics** (time, model version) - **Raw JSON response** (expandable) #### Detailed Input Mode ![gui_detailed_input](docs/img/gui_detailed_input.png) Add optional metadata: - Repository name - PR number - Detailed description #### Example Gallery ![gui_ex](docs/img/gui_ex.png) Test with pre-loaded examples: - Authentication bugs - ML features - Database issues - UI enhancements ### Usage 1. Enter GitHub issue/PR text in the input area 2. (Optional) Add description, repo name, PR number 3. Click "Predict Skills" 4. View results with confidence scores 5. Adjust threshold slider to filter predictions