Spaces:
Sleeping
Sleeping
| title: Hopcroft Skill Classification | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| api_docs_url: /docs | |
| # Hopcroft_Skill-Classification-Tool-Competition | |
| The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution. | |
| ## Project Organization | |
| ``` | |
| βββ LICENSE <- Open-source license if one is chosen | |
| βββ Makefile <- Makefile with convenience commands like `make data` or `make train` | |
| βββ README.md <- The top-level README for developers using this project. | |
| βββ data | |
| β βββ external <- Data from third party sources. | |
| β βββ interim <- Intermediate data that has been transformed. | |
| β βββ processed <- The final, canonical data sets for modeling. | |
| β βββ raw <- The original, immutable data dump. | |
| β | |
| βββ docs <- A default mkdocs project; see www.mkdocs.org for details | |
| β | |
| βββ models <- Trained and serialized models, model predictions, or model summaries | |
| β | |
| βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), | |
| β the creator's initials, and a short `-` delimited description, e.g. | |
| β `1.0-jqp-initial-data-exploration`. | |
| β | |
| βββ pyproject.toml <- Project configuration file with package metadata for | |
| β hopcroft_skill_classification_tool_competition and configuration for tools like black | |
| β | |
| βββ references <- Data dictionaries, manuals, and all other explanatory materials. | |
| β | |
| βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc. | |
| β βββ figures <- Generated graphics and figures to be used in reporting | |
| β | |
| βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g. | |
| β generated with `pip freeze > requirements.txt` | |
| β | |
| βββ setup.cfg <- Configuration file for flake8 | |
| β | |
| βββ hopcroft_skill_classification_tool_competition <- Source code for use in this project. | |
| β | |
| βββ __init__.py <- Makes hopcroft_skill_classification_tool_competition a Python module | |
| β | |
| βββ config.py <- Store useful variables and configuration | |
| β | |
| βββ dataset.py <- Scripts to download or generate data | |
| β | |
| βββ features.py <- Code to create features for modeling | |
| β | |
| βββ modeling | |
| β βββ __init__.py | |
| β βββ predict.py <- Code to run model inference with trained models | |
| β βββ train.py <- Code to train models | |
| β | |
| βββ plots.py <- Code to create visualizations | |
| ``` | |
| -------- | |
| ## Setup | |
| ### MLflow Credentials Configuration | |
| Set up DagsHub credentials for MLflow tracking. | |
| **Get your token:** [DagsHub](https://dagshub.com) β Profile β Settings β Tokens | |
| #### Option 1: Using `.env` file (Recommended for local development) | |
| ```bash | |
| # Copy the template | |
| cp .env.example .env | |
| # Edit .env with your credentials | |
| ``` | |
| Your `.env` file should contain: | |
| ``` | |
| MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow | |
| MLFLOW_TRACKING_USERNAME=your_username | |
| MLFLOW_TRACKING_PASSWORD=your_token | |
| ``` | |
| > [!NOTE] | |
| > The `.env` file is git-ignored for security. Never commit credentials to version control. | |
| #### Option 2: Using Docker Compose | |
| When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`. | |
| ```bash | |
| # Start the service (credentials loaded from .env) | |
| docker compose up --build | |
| ``` | |
| -------- | |
| ## CI Configuration | |
| [](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml) | |
| This project uses automatically triggered GitHub Actions triggers for Continuous Integration. | |
| ### Secrets | |
| To enable DVC model pulling, configure these Repository Secrets: | |
| - `DAGSHUB_USERNAME`: DagsHub username. | |
| - `DAGSHUB_TOKEN`: DagsHub access token. | |
| -------- | |
| ## Milestone Summary | |
| ### Milestone 1 | |
| We compiled the ML Canvas and defined: | |
| - Problem: multi-label classification of skills for PR/issues. | |
| - Stakeholders and business/research goals. | |
| - Data sources (SkillScope DB) and constraints (no external classifiers). | |
| - Success metrics (micro-F1, imbalance handling, experiment tracking). | |
| - Risks (label imbalance, text noise, multi-label complexity) and mitigations. | |
| ### Milestone 2 | |
| We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments: | |
| 1. Data Management | |
| - DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models. | |
| 2. Data Ingestion & EDA | |
| - `dataset.py` to download/extract SkillScope from Hugging Face (zip β SQLite) with cleanup. | |
| - Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution). | |
| 3. Feature Engineering | |
| - `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`). | |
| 4. Central Config | |
| - `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants. | |
| 5. Modeling & Experiments | |
| - Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference. | |
| - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback). | |
| 6. Imbalance Handling | |
| - Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline. | |
| 7. Tracking & Reproducibility | |
| - Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices). | |
| 8. Tooling | |
| - Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`). | |
| ### Milestone 3 (QA) | |
| We implemented a comprehensive testing and validation framework to ensure data quality and model robustness: | |
| 1. **Data Cleaning Pipeline** | |
| - `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage. | |
| - Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split. | |
| 2. **Great Expectations Validation** (10 tests) | |
| - Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency. | |
| - Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency. | |
| - All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`. | |
| 3. **Deepchecks Validation** (24 checks across 2 suites) | |
| - Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation. | |
| - Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis. | |
| - Cleaned data achieved production-ready status (96% overall score). | |
| 4. **Behavioral Testing** (36 tests) | |
| - Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance. | |
| - Directional tests (10): keyword addition effects, technical detail impact on predictions. | |
| - Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps). | |
| - All tests passed; comprehensive report in `reports/behavioral/`. | |
| 5. **Code Quality Analysis** | |
| - Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable. | |
| - PEP 8 compliant, Black compatible (line length 88). | |
| 6. **Documentation** | |
| - Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results. | |
| - Behavioral testing README with test categories, usage examples, and extension guide. | |
| 7. **Tooling** | |
| - Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`. | |
| - Automated test execution and report generation. | |
| ### Milestone 4 (API) | |
| We implemented a production-ready FastAPI service for skill prediction with MLflow integration: | |
| #### Features | |
| - **REST API Endpoints**: | |
| - `POST /predict` - Predict skills for a GitHub issue (logs to MLflow) | |
| - `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID | |
| - `GET /predictions` - List recent predictions with pagination | |
| - `GET /health` - Health check endpoint | |
| - **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/` | |
| - **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps | |
| - **Input Validation**: Pydantic models for request/response validation | |
| - **Interactive Docs**: Auto-generated Swagger UI and ReDoc | |
| #### API Usage | |
| **1. Start the API Server** | |
| ```bash | |
| # Development mode (auto-reload) | |
| make api-dev | |
| # Production mode | |
| make api-run | |
| ``` | |
| Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000) | |
| **2. Test Endpoints** | |
| **Option A: Swagger UI (Recommended)** | |
| - Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) | |
| - Interactive interface to test all endpoints | |
| - View request/response schemas | |
| **Option B: Make Commands** | |
| ```bash | |
| # Test all endpoints | |
| make test-api-all | |
| # Individual endpoints | |
| make test-api-health # Health check | |
| make test-api-predict # Single prediction | |
| make test-api-list # List predictions | |
| ``` | |
| #### Prerequisites | |
| - Trained model: `models/random_forest_tfidf_gridsearch.pkl` | |
| - TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation) | |
| - Label names: `models/label_names.pkl` (auto-saved during feature creation) | |
| #### MLflow Integration | |
| - All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow` | |
| - Experiment: `skill_prediction_api` | |
| - Tracked: input text, predictions, probabilities, metadata | |
| #### Docker | |
| Build and run the API in a container: | |
| ```bash | |
| docker build -t hopcroft-api . | |
| docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api | |
| ``` | |
| Endpoints: | |
| - Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs) | |
| - Health check: [http://localhost:8080/health](http://localhost:8080/health) | |
| --- | |
| ## Docker Compose Usage | |
| Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration. | |
| ### Prerequisites | |
| 1. **Create your environment file:** | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| 2. **Edit `.env`** with your actual credentials: | |
| ``` | |
| MLFLOW_TRACKING_USERNAME=your_dagshub_username | |
| MLFLOW_TRACKING_PASSWORD=your_dagshub_token | |
| ``` | |
| Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens) | |
| ### Quick Start | |
| #### 1. Build and Start All Services | |
| Build both images and start the containers: | |
| ```bash | |
| docker-compose up -d --build | |
| ``` | |
| | Flag | Description | | |
| |------|-------------| | |
| | `-d` | Run in detached mode (background) | | |
| | `--build` | Rebuild images before starting (use when code/Dockerfile changes) | | |
| **Available Services:** | |
| - **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs) | |
| - **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501) | |
| - **Health Check:** [http://localhost:8080/health](http://localhost:8080/health) | |
| #### 2. Stop All Services | |
| Stop and remove containers and networks: | |
| ```bash | |
| docker-compose down | |
| ``` | |
| | Flag | Description | | |
| |------|-------------| | |
| | `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` | | |
| | `--rmi all` | Also remove images: `docker-compose down --rmi all` | | |
| #### 3. Restart Services | |
| After updating `.env` or configuration files: | |
| ```bash | |
| docker-compose restart | |
| ``` | |
| Or for a full restart with environment reload: | |
| ```bash | |
| docker-compose down | |
| docker-compose up -d | |
| ``` | |
| #### 4. Check Status | |
| View the status of all running services: | |
| ```bash | |
| docker-compose ps | |
| ``` | |
| Or use Docker commands: | |
| ```bash | |
| docker ps | |
| ``` | |
| #### 5. View Logs | |
| Tail logs from both services in real-time: | |
| ```bash | |
| docker-compose logs -f | |
| ``` | |
| View logs from a specific service: | |
| ```bash | |
| docker-compose logs -f hopcroft-api | |
| docker-compose logs -f hopcroft-gui | |
| ``` | |
| | Flag | Description | | |
| |------|-------------| | |
| | `-f` | Follow log output (stream new logs) | | |
| | `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` | | |
| #### 6. Execute Commands in Container | |
| Open an interactive shell inside a running container: | |
| ```bash | |
| docker-compose exec hopcroft-api /bin/bash | |
| docker-compose exec hopcroft-gui /bin/bash | |
| ``` | |
| Examples of useful commands inside the API container: | |
| ```bash | |
| # Check installed packages | |
| pip list | |
| # Run Python interactively | |
| python | |
| # Check model file exists | |
| ls -la /app/models/ | |
| # Verify environment variables | |
| printenv | grep MLFLOW | |
| ``` | |
| ``` | |
| ### Architecture Overview | |
| **Docker Compose orchestrates two services:** | |
| ``` | |
| docker-compose.yml | |
| βββ hopcroft-api (FastAPI Backend) | |
| β βββ Build: ./Dockerfile | |
| β βββ Port: 8080:8080 | |
| β βββ Network: hopcroft-net | |
| β βββ Environment: .env (MLflow credentials) | |
| β βββ Volumes: | |
| β β βββ ./hopcroft_skill_classification_tool_competition (hot reload) | |
| β β βββ hopcroft-logs:/app/logs (persistent logs) | |
| β βββ Health Check: /health endpoint | |
| β | |
| βββ hopcroft-gui (Streamlit Frontend) | |
| β βββ Build: ./Dockerfile.streamlit | |
| β βββ Port: 8501:8501 | |
| β βββ Network: hopcroft-net | |
| β βββ Environment: API_BASE_URL=http://hopcroft-api:8080 | |
| β βββ Volumes: | |
| β β βββ ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload) | |
| β βββ Depends on: hopcroft-api (waits for health check) | |
| β | |
| βββ hopcroft-net (bridge network) | |
| ``` | |
| **External Access:** | |
| - API: http://localhost:8080 | |
| - GUI: http://localhost:8501 | |
| **Internal Communication:** | |
| - GUI β API: http://hopcroft-api:8080 (via Docker network) | |
| ### Services Description | |
| **hopcroft-api (FastAPI Backend)** | |
| - Purpose: FastAPI backend serving the ML model for skill classification | |
| - Image: Built from `Dockerfile` | |
| - Port: 8080 (maps to host 8080) | |
| - Features: | |
| - Random Forest model with embedding features | |
| - MLflow experiment tracking | |
| - Auto-reload in development mode | |
| - Health check endpoint | |
| **hopcroft-gui (Streamlit Frontend)** | |
| - Purpose: Streamlit web interface for interactive predictions | |
| - Image: Built from `Dockerfile.streamlit` | |
| - Port: 8501 (maps to host 8501) | |
| - Features: | |
| - User-friendly interface for skill prediction | |
| - Real-time communication with API | |
| - Automatic reconnection on API restart | |
| - Depends on API health before starting | |
| ### Development vs Production | |
| **Development (default):** | |
| - Auto-reload enabled (`--reload`) | |
| - Source code mounted with bind mounts | |
| - Custom command with hot reload | |
| - GUI β API via Docker network | |
| **Production:** | |
| - Auto-reload disabled | |
| - Use built image only | |
| - Use Dockerfile's CMD | |
| - GUI β API via Docker network | |
| For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload. | |
| ### Troubleshooting | |
| #### Issue: GUI shows "API is not available" | |
| **Solution:** | |
| 1. Wait 30-60 seconds for API to fully initialize and become healthy | |
| 2. Refresh the GUI page (F5) | |
| 3. Check API health: `curl http://localhost:8080/health` | |
| 4. Check logs: `docker-compose logs hopcroft-api` | |
| #### Issue: "500 Internal Server Error" on predictions | |
| **Solution:** | |
| 1. Verify MLflow credentials in `.env` are correct | |
| 2. Restart services: `docker-compose down && docker-compose up -d` | |
| 3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW` | |
| #### Issue: Changes to code not reflected | |
| **Solution:** | |
| - For Python code changes: Auto-reload is enabled, wait a few seconds | |
| - For Dockerfile changes: Rebuild with `docker-compose up -d --build` | |
| - For `.env` changes: Restart with `docker-compose down && docker-compose up -d` | |
| #### Issue: Port already in use | |
| **Solution:** | |
| ```bash | |
| # Check what's using the port | |
| netstat -ano | findstr :8080 | |
| netstat -ano | findstr :8501 | |
| # Stop existing containers | |
| docker-compose down | |
| # Or change ports in docker-compose.yml | |
| ``` | |
| -------- | |
| ## Hugging Face Spaces Deployment | |
| This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker. | |
| ### 1. Setup Space | |
| 1. Create a new Space on Hugging Face. | |
| 2. Select **Docker** as the SDK. | |
| 3. Choose the **Blank** template or upload your code. | |
| ### 2. Configure Secrets | |
| To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings: | |
| | Name | Type | Description | | |
| |------|------|-------------| | |
| | `DAGSHUB_USERNAME` | Secret | Your DagsHub username. | | |
| | `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). | | |
| > [!IMPORTANT] | |
| > These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI. | |
| ### 3. Automated Startup | |
| The deployment follows this automated flow: | |
| 1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx. | |
| 2. **scripts/start_space.sh**: | |
| - Configures DVC with your secrets. | |
| - Pulls models from the DagsHub remote. | |
| - Starts the **FastAPI** backend (port 8000). | |
| - Starts the **Streamlit** frontend (port 8501). | |
| - Starts **Nginx** (port 7860) as a reverse proxy to route traffic. | |
| ### 4. Direct Access | |
| Once deployed, your Space will be available at: | |
| `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft` | |
| The API documentation will be accessible at: | |
| `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs` | |
| -------- | |
| ## Demo UI (Streamlit) | |
| The Streamlit GUI provides an interactive web interface for the skill classification API. | |
| ### Features | |
| - Real-time skill prediction from GitHub issue text | |
| - Top-5 predicted skills with confidence scores | |
| - Full predictions table with all skills | |
| - API connection status indicator | |
| - Responsive design | |
| ### Usage | |
| 1. Ensure both services are running: `docker-compose up -d` | |
| 2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501) | |
| 3. Enter a GitHub issue description in the text area | |
| 4. Click "Predict Skills" to get predictions | |
| 5. View results in the predictions table | |
| ### Architecture | |
| - **Frontend**: Streamlit (Python web framework) | |
| - **Communication**: HTTP requests to FastAPI backend via Docker network | |
| - **Independence**: GUI and API run in separate containers | |
| - **Auto-reload**: GUI code changes are reflected immediately (bind mount) | |
| > Both must run **simultaneously** in different terminals/containers. | |
| ### Quick Start | |
| 1. **Start the FastAPI backend:** | |
| ```bash | |
| fastapi dev hopcroft_skill_classification_tool_competition/main.py | |
| ``` | |
| 2. **In a new terminal, start Streamlit:** | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| 3. **Open your browser:** | |
| - Streamlit UI: http://localhost:8501 | |
| - FastAPI Docs: http://localhost:8000/docs | |
| ### Features | |
| - Interactive web interface for skill prediction | |
| - Real-time predictions with confidence scores | |
| - Adjustable confidence threshold | |
| - Multiple input modes (quick/detailed/examples) | |
| - Visual result display | |
| - API health monitoring | |
| ### Demo Walkthrough | |
| #### Main Dashboard | |
|  | |
| The main interface provides: | |
| - **Sidebar**: API health status, confidence threshold slider, model info | |
| - **Three input modes**: Quick Input, Detailed Input, Examples | |
| #### Quick Input Mode | |
|  | |
| Simply paste your GitHub issue text and click "Predict Skills"! | |
| #### Prediction Results | |
|  | |
| View: | |
| - **Top predictions** with confidence scores | |
| - **Full predictions table** with filtering | |
| - **Processing metrics** (time, model version) | |
| - **Raw JSON response** (expandable) | |
| #### Detailed Input Mode | |
|  | |
| Add optional metadata: | |
| - Repository name | |
| - PR number | |
| - Detailed description | |
| #### Example Gallery | |
|  | |
| Test with pre-loaded examples: | |
| - Authentication bugs | |
| - ML features | |
| - Database issues | |
| - UI enhancements | |
| ### Usage | |
| 1. Enter GitHub issue/PR text in the input area | |
| 2. (Optional) Add description, repo name, PR number | |
| 3. Click "Predict Skills" | |
| 4. View results with confidence scores | |
| 5. Adjust threshold slider to filter predictions |