Hopcroft-Skill-Classification

Sleeping

App Files Files Community

DaCrow13 commited on Dec 14, 2025

Commit

39d224b

0 Parent(s):

Deploy to HF Spaces (Clean)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +31 -0
.dvc/.gitignore +3 -0
.dvc/config +6 -0
.dvcignore +3 -0
.env.example +19 -0
.github/workflows/ci.yml +68 -0
.gitignore +192 -0
Dockerfile +40 -0
Dockerfile.streamlit +29 -0
Makefile +189 -0
README.md +576 -0
data/.gitignore +1 -0
data/README.md +83 -0
data/processed/.gitignore +2 -0
data/processed/embedding.dvc +6 -0
data/processed/tfidf.dvc +6 -0
data/raw.dvc +6 -0
docker-compose.yml +56 -0
docs/.gitkeep +0 -0
docs/ML Canvas.md +39 -0
docs/README.md +12 -0
docs/docs/getting-started.md +6 -0
docs/docs/index.md +10 -0
docs/mkdocs.yml +4 -0
docs/testing_and_validation.md +208 -0
hopcroft_skill_classification_tool_competition/__init__.py +0 -0
hopcroft_skill_classification_tool_competition/api_models.py +221 -0
hopcroft_skill_classification_tool_competition/config.py +137 -0
hopcroft_skill_classification_tool_competition/data_cleaning.py +559 -0
hopcroft_skill_classification_tool_competition/dataset.py +99 -0
hopcroft_skill_classification_tool_competition/features.py +492 -0
hopcroft_skill_classification_tool_competition/main.py +434 -0
hopcroft_skill_classification_tool_competition/mlsmote.py +157 -0
hopcroft_skill_classification_tool_competition/modeling/predict.py +198 -0
hopcroft_skill_classification_tool_competition/modeling/train.py +858 -0
hopcroft_skill_classification_tool_competition/streamlit_app.py +322 -0
hopcroft_skill_classification_tool_competition/threshold_optimization.py +295 -0
models/.gitignore +11 -0
models/.gitkeep +0 -0
models/README.md +206 -0
models/kept_label_indices.npy +0 -0
models/label_names.pkl.dvc +5 -0
models/random_forest_embedding_gridsearch.pkl.dvc +5 -0
models/random_forest_embedding_gridsearch_smote.pkl.dvc +5 -0
models/random_forest_tfidf_gridsearch.pkl.dvc +5 -0
models/random_forest_tfidf_gridsearch_adasyn_pca.pkl.dvc +5 -0
models/random_forest_tfidf_gridsearch_ros.pkl.dvc +5 -0
models/random_forest_tfidf_gridsearch_smote.pkl.dvc +5 -0
models/tfidf_vectorizer.pkl.dvc +5 -0
notebooks/.gitkeep +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,31 @@

+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+pip-log.txt
+pip-delete-this-directory.txt
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.log
+.git
+.gitignore
+.mypy_cache
+.pytest_cache
+.hydra
+.dvc/
+data/
+mlruns/
+notebooks/
+reports/
+docs/
+tests/
+scripts/
+!scripts/start_space.sh

.dvc/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+/config.local
+/tmp
+/cache

.dvc/config ADDED Viewed

	@@ -0,0 +1,6 @@

+[cache]
+    type = copy
+[core]
+    remote = origin
+['remote "origin"']
+    url = https://dagshub.com/se4ai2526-uniba/Hopcroft.dvc

.dvcignore ADDED Viewed

	@@ -0,0 +1,3 @@

+# Add patterns of files dvc should ignore, which could improve
+# the performance. Learn more at
+# https://dvc.org/doc/user-guide/dvcignore

.env.example ADDED Viewed

	@@ -0,0 +1,19 @@

+# ============================================
+# Hopcroft API Environment Configuration
+# ============================================
+# Copy this file to .env and update with your values
+# Command: cp .env.example .env
+# IMPORTANT: Never commit .env to version control!
+# MLflow Configuration
+MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
+MLFLOW_TRACKING_USERNAME=your_username
+MLFLOW_TRACKING_PASSWORD=your_token
+# API Configuration
+API_HOST=0.0.0.0
+API_PORT=8080
+LOG_LEVEL=info
+# Model Configuration
+MODEL_PATH=/app/models/random_forest_embedding_gridsearch.pkl

.github/workflows/ci.yml ADDED Viewed

	@@ -0,0 +1,68 @@

+name: CI Pipeline
+on:
+  push:
+    branches: [ "main", "feature/*" ]
+  pull_request:
+    branches: [ "main" ]
+jobs:
+  build-and-test:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v3
+    - name: Free Disk Space
+      run: |
+        sudo rm -rf /usr/share/dotnet
+        sudo rm -rf /usr/local/lib/android
+        sudo rm -rf /opt/ghc
+        sudo rm -rf /opt/hostedtoolcache/CodeQL
+        sudo docker image prune --all --force
+    - name: Set up Python 3.10
+      uses: actions/setup-python@v4
+      with:
+        python-version: "3.10"
+        cache: 'pip' # Enable caching for pip
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        # Install CPU-only PyTorch to save space (we don't need CUDA for tests)
+        pip install torch --index-url https://download.pytorch.org/whl/cpu
+        # Install other dependencies
+        pip install -r requirements.txt --no-cache-dir
+    - name: Lint with Ruff
+      run: |
+        # Using make lint as defined in Makefile
+        make lint
+    - name: Run Unit Tests
+      run: |
+        # Run tests and generate HTML report
+        pytest tests/unit/ -v -m unit --html=report.html --self-contained-html
+    - name: Upload Test Report
+      if: always()  # Upload report even if tests fail
+      uses: actions/upload-artifact@v4
+      with:
+        name: test-report
+        path: report.html
+    - name: Configure DVC
+      run: |
+        dvc remote modify origin --local auth basic
+        dvc remote modify origin --local user ${{ secrets.DAGSHUB_USERNAME }}
+        dvc remote modify origin --local password ${{ secrets.DAGSHUB_TOKEN }}
+    - name: Pull Models with DVC
+      run: |
+        dvc pull models/random_forest_embedding_gridsearch.pkl models/label_names.pkl
+    - name: Build Docker Image
+      run: |
+        docker build . -t hopcroft-app:latest

.gitignore ADDED Viewed

	@@ -0,0 +1,192 @@

+# Mac OS-specific storage files
+.DS_Store
+# vim
+*.swp
+*.swo
+## https://github.com/github/gitignore/blob/e8554d85bf62e38d6db966a50d2064ac025fd82a/Python.gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# MkDocs documentation
+docs/site/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# pixi
+#   pixi.lock should be committed to version control for reproducibility
+#   .pixi/ contains the environments and should not be committed
+.pixi/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+.github/copilot-instructions.md
+docs/img/

Dockerfile ADDED Viewed

	@@ -0,0 +1,40 @@

+FROM python:3.10-slim
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=off \
+    PIP_DISABLE_PIP_VERSION_CHECK=on \
+    PIP_DEFAULT_TIMEOUT=100
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Create a non-root user
+RUN useradd -m -u 1000 user
+# Set working directory
+WORKDIR /app
+# Copy requirements first for caching
+COPY requirements.txt .
+# Install dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application
+COPY --chown=user:user . .
+# Make start script executable
+RUN chmod +x scripts/start_space.sh
+# Switch to non-root user
+USER user
+# Expose the port
+EXPOSE 7860
+# Command to run the application
+CMD ["./scripts/start_space.sh"]

Dockerfile.streamlit ADDED Viewed

	@@ -0,0 +1,29 @@

+FROM python:3.10-slim
+WORKDIR /app
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1
+# Create non-root user
+RUN useradd -m -u 1000 appuser
+# Install only Streamlit dependencies
+RUN pip install --no-cache-dir \
+    streamlit>=1.28.0 \
+    requests>=2.31.0 \
+    pandas>=2.0.0
+# Copy only the Streamlit app
+COPY --chown=appuser:appuser hopcroft_skill_classification_tool_competition/streamlit_app.py ./
+EXPOSE 8501
+USER appuser
+# Set API URL to point to the API service
+ENV API_BASE_URL=http://hopcroft-api:8080
+CMD ["streamlit", "run", "streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Makefile ADDED Viewed

	@@ -0,0 +1,189 @@

+#################################################################################
+# GLOBALS                                                                       #
+#################################################################################
+PROJECT_NAME = Hopcroft
+PYTHON_VERSION = 3.10
+PYTHON_INTERPRETER = python
+#################################################################################
+# COMMANDS                                                                      #
+#################################################################################
+## Install Python dependencies
+.PHONY: requirements
+requirements:
+	$(PYTHON_INTERPRETER) -m pip install -U pip
+	$(PYTHON_INTERPRETER) -m pip install -r requirements.txt
+## Delete all compiled Python files
+.PHONY: clean
+clean:
+	find . -type f -name "*.py[co]" -delete
+	find . -type d -name "__pycache__" -delete
+## Lint using ruff
+.PHONY: lint
+lint:
+	ruff format --check
+	ruff check
+## Format source code with ruff
+.PHONY: format
+format:
+	ruff check --fix
+	ruff format
+#################################################################################
+# PROJECT RULES                                                                 #
+#################################################################################
+## Download dataset from Hugging Face
+.PHONY: data
+data:
+	$(PYTHON_INTERPRETER) -m hopcroft_skill_classification_tool_competition.dataset
+## Extract features from raw data
+.PHONY: features
+features:
+	$(PYTHON_INTERPRETER) -m hopcroft_skill_classification_tool_competition.features
+#################################################################################
+# TRAINING RULES                                                                #
+#################################################################################
+## Train Random Forest baseline with TF-IDF features (cleaned data)
+.PHONY: train-baseline-tfidf
+train-baseline-tfidf:
+	$(PYTHON_INTERPRETER) -m hopcroft_skill_classification_tool_competition.modeling.train baseline
+## Train Random Forest baseline with Embedding features (cleaned data)
+.PHONY: train-baseline-embeddings
+train-baseline-embeddings:
+	$(PYTHON_INTERPRETER) -c "from hopcroft_skill_classification_tool_competition.modeling.train import run_baseline_train; run_baseline_train(feature_type='embedding', use_cleaned=True)"
+## Train Random Forest with SMOTE and TF-IDF features (cleaned data)
+.PHONY: train-smote-tfidf
+train-smote-tfidf:
+	$(PYTHON_INTERPRETER) -c "from hopcroft_skill_classification_tool_competition.modeling.train import run_smote_experiment, load_data; X, Y = load_data(feature_type='tfidf', use_cleaned=True); run_smote_experiment(X, Y, feature_type='tfidf')"
+## Train Random Forest with SMOTE and Embedding features (cleaned data)
+.PHONY: train-smote-embeddings
+train-smote-embeddings:
+	$(PYTHON_INTERPRETER) -c "from hopcroft_skill_classification_tool_competition.modeling.train import run_smote_experiment, load_data; X, Y = load_data(feature_type='embedding', use_cleaned=True); run_smote_experiment(X, Y, feature_type='embedding')"
+#################################################################################
+# TESTING RULES                                                                 #
+#################################################################################
+## Run all unit tests
+.PHONY: test-unit
+test-unit:
+	pytest tests/unit/ -v -m unit
+## Run all integration tests
+.PHONY: test-integration
+test-integration:
+	pytest tests/integration/ -v -m integration
+## Run all system tests
+.PHONY: test-system
+test-system:
+	pytest tests/system/ -v -m system
+## Run all tests (unit, integration, system)
+.PHONY: test-all
+test-all:
+	pytest tests/ -v --ignore=tests/behavioral --ignore=tests/deepchecks
+## Run tests with coverage report
+.PHONY: test-coverage
+test-coverage:
+	pytest tests/ --cov=hopcroft_skill_classification_tool_competition --cov-report=html --cov-report=term
+## Run fast tests only (exclude slow tests)
+.PHONY: test-fast
+test-fast:
+	pytest tests/ -v -m "not slow" --ignore=tests/behavioral --ignore=tests/deepchecks
+## Run behavioral tests
+.PHONY: test-behavioral
+test-behavioral:
+	pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
+## Run Great Expectations validation
+.PHONY: validate-gx
+validate-gx:
+	$(PYTHON_INTERPRETER) -m hopcroft_skill_classification_tool_competition.tests.test_gx
+## Run Deepchecks validation
+.PHONY: validate-deepchecks
+validate-deepchecks:
+	$(PYTHON_INTERPRETER) tests/deepchecks/run_all_deepchecks.py
+## Run all validation and tests
+.PHONY: test-complete
+test-complete: test-all validate-gx validate-deepchecks test-behavioral
+#################################################################################
+# Self Documenting Commands                                                     #
+#################################################################################
+.DEFAULT_GOAL := help
+define PRINT_HELP_PYSCRIPT
+import re, sys; \
+lines = '\n'.join([line for line in sys.stdin]); \
+matches = re.findall(r'\n## (.*)\n[\s\S]+?\n([a-zA-Z_-]+):', lines); \
+print('Available rules:\n'); \
+print('\n'.join(['{:25}{}'.format(*reversed(match)) for match in matches]))
+endef
+export PRINT_HELP_PYSCRIPT
+help:
+	@$(PYTHON_INTERPRETER) -c "${PRINT_HELP_PYSCRIPT}" < $(MAKEFILE_LIST)
+################################################################################
+# API COMMANDS                                                                 #
+################################################################################
+## Run API in development mode
+.PHONY: api-dev
+api-dev:
+	fastapi dev hopcroft_skill_classification_tool_competition/main.py
+## Run API in production mode
+.PHONY: api-run
+api-run:
+	fastapi run hopcroft_skill_classification_tool_competition/main.py
+## Test API health check (requires running API)
+.PHONY: test-api-health
+test-api-health:
+	@echo "Testing API health endpoint..."
+	curl -X GET "http://127.0.0.1:8000/health"
+## Test API POST /predict (requires running API)
+.PHONY: test-api-predict
+test-api-predict:
+	@echo "Testing prediction endpoint..."
+	curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"issue_text": "Fix critical bug in authentication and login flow with OAuth2", "repo_name": "my-repo"}'
+## Test API GET /predictions (requires running API)
+.PHONY: test-api-list
+test-api-list:
+	@echo "Testing list predictions endpoint..."
+	curl "http://127.0.0.1:8000/predictions?limit=5"
+## Test API GET /predictions/{run_id} (requires running API and valid run_id)
+.PHONY: test-api-get-prediction
+test-api-get-prediction:
+	@echo "Testing get specific prediction endpoint..."
+	@echo "Usage: make test-api-get-prediction RUN_ID=<your_run_id>"
+	@if [ -z "$(RUN_ID)" ]; then echo "Error: RUN_ID not set. Example: make test-api-get-prediction RUN_ID=abc123"; exit 1; fi
+	curl "http://127.0.0.1:8000/predictions/$(RUN_ID)"
+## Run all API tests (requires running API)
+.PHONY: test-api-all
+test-api-all: test-api-health test-api-predict test-api-list
+	@echo "\n All API tests completed!"

README.md ADDED Viewed

	@@ -0,0 +1,576 @@

+---
+title: Hopcroft Skill Classification
+emoji: 🧠
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+---
+# Hopcroft_Skill-Classification-Tool-Competition
+The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.
+## Project Organization
+```
+├── LICENSE            <- Open-source license if one is chosen
+├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
+├── README.md          <- The top-level README for developers using this project.
+├── data
+│   ├── external       <- Data from third party sources.
+│   ├── interim        <- Intermediate data that has been transformed.
+│   ├── processed      <- The final, canonical data sets for modeling.
+│   └── raw            <- The original, immutable data dump.
+│
+├── docs               <- A default mkdocs project; see www.mkdocs.org for details
+│
+├── models             <- Trained and serialized models, model predictions, or model summaries
+│
+├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
+│                         the creator's initials, and a short `-` delimited description, e.g.
+│                         `1.0-jqp-initial-data-exploration`.
+│
+├── pyproject.toml     <- Project configuration file with package metadata for
+│                         hopcroft_skill_classification_tool_competition and configuration for tools like black
+│
+├── references         <- Data dictionaries, manuals, and all other explanatory materials.
+│
+├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
+│   └── figures        <- Generated graphics and figures to be used in reporting
+│
+├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
+│                         generated with `pip freeze > requirements.txt`
+│
+├── setup.cfg          <- Configuration file for flake8
+│
+└── hopcroft_skill_classification_tool_competition   <- Source code for use in this project.
+    │
+    ├── __init__.py             <- Makes hopcroft_skill_classification_tool_competition a Python module
+    │
+    ├── config.py               <- Store useful variables and configuration
+    │
+    ├── dataset.py              <- Scripts to download or generate data
+    │
+    ├── features.py             <- Code to create features for modeling
+    │
+    ├── modeling
+    │   ├── __init__.py
+    │   ├── predict.py          <- Code to run model inference with trained models
+    │   └── train.py            <- Code to train models
+    │
+    └── plots.py                <- Code to create visualizations
+```
+--------
+## Setup
+### MLflow Credentials Configuration
+Set up DagsHub credentials for MLflow tracking.
+**Get your token:** [DagsHub](https://dagshub.com) → Profile → Settings → Tokens
+#### Option 1: Using `.env` file (Recommended for local development)
+```bash
+# Copy the template
+cp .env.example .env
+# Edit .env with your credentials
+```
+Your `.env` file should contain:
+```
+MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
+MLFLOW_TRACKING_USERNAME=your_username
+MLFLOW_TRACKING_PASSWORD=your_token
+```
+> [!NOTE]
+> The `.env` file is git-ignored for security. Never commit credentials to version control.
+#### Option 2: Using Docker Compose
+When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`.
+```bash
+# Start the service (credentials loaded from .env)
+docker compose up --build
+```
+--------
+## CI Configuration
+[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)
+This project uses automatically triggered GitHub Actions triggers for Continuous Integration.
+### Secrets
+To enable DVC model pulling, configure these Repository Secrets:
+- `DAGSHUB_USERNAME`: DagsHub username.
+- `DAGSHUB_TOKEN`: DagsHub access token.
+--------
+## Milestone Summary
+### Milestone 1
+We compiled the ML Canvas and defined:
+- Problem: multi-label classification of skills for PR/issues.
+- Stakeholders and business/research goals.
+- Data sources (SkillScope DB) and constraints (no external classifiers).
+- Success metrics (micro-F1, imbalance handling, experiment tracking).
+- Risks (label imbalance, text noise, multi-label complexity) and mitigations.
+### Milestone 2
+We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:
+1. Data Management
+    - DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.
+2. Data Ingestion & EDA
+    - `dataset.py` to download/extract SkillScope from Hugging Face (zip → SQLite) with cleanup.
+    - Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution).
+3. Feature Engineering
+    - `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`).
+4. Central Config
+    - `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.
+5. Modeling & Experiments
+    - Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
+    - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).
+6. Imbalance Handling
+    - Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline.
+7. Tracking & Reproducibility
+    - Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).
+8. Tooling
+    - Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`).
+### Milestone 3 (QA)
+We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:
+1. **Data Cleaning Pipeline**
+    - `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
+    - Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.
+2. **Great Expectations Validation** (10 tests)
+    - Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
+    - Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
+    - All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`.
+3. **Deepchecks Validation** (24 checks across 2 suites)
+    - Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
+    - Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis.
+    - Cleaned data achieved production-ready status (96% overall score).
+4. **Behavioral Testing** (36 tests)
+    - Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
+    - Directional tests (10): keyword addition effects, technical detail impact on predictions.
+    - Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
+    - All tests passed; comprehensive report in `reports/behavioral/`.
+5. **Code Quality Analysis**
+    - Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
+    - PEP 8 compliant, Black compatible (line length 88).
+6. **Documentation**
+    - Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results.
+    - Behavioral testing README with test categories, usage examples, and extension guide.
+7. **Tooling**
+    - Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`.
+    - Automated test execution and report generation.
+### Milestone 4 (API)
+We implemented a production-ready FastAPI service for skill prediction with MLflow integration:
+#### Features
+- **REST API Endpoints**:
+  - `POST /predict` - Predict skills for a GitHub issue (logs to MLflow)
+  - `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID
+  - `GET /predictions` - List recent predictions with pagination
+  - `GET /health` - Health check endpoint
+- **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/`
+- **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps
+- **Input Validation**: Pydantic models for request/response validation
+- **Interactive Docs**: Auto-generated Swagger UI and ReDoc
+#### API Usage
+**1. Start the API Server**
+```bash
+# Development mode (auto-reload)
+make api-dev
+# Production mode
+make api-run
+```
+Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000)
+**2. Test Endpoints**
+**Option A: Swagger UI (Recommended)**
+- Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+- Interactive interface to test all endpoints
+- View request/response schemas
+**Option B: Make Commands**
+```bash
+# Test all endpoints
+make test-api-all
+# Individual endpoints
+make test-api-health        # Health check
+make test-api-predict       # Single prediction
+make test-api-list          # List predictions
+```
+#### Prerequisites
+- Trained model: `models/random_forest_tfidf_gridsearch.pkl`
+- TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation)
+- Label names: `models/label_names.pkl` (auto-saved during feature creation)
+#### MLflow Integration
+- All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
+- Experiment: `skill_prediction_api`
+- Tracked: input text, predictions, probabilities, metadata
+#### Docker
+Build and run the API in a container:
+```bash
+docker build -t hopcroft-api .
+docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
+```
+Endpoints:
+- Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs)
+- Health check: [http://localhost:8080/health](http://localhost:8080/health)
+---
+## Docker Compose Usage
+Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration.
+### Prerequisites
+1. **Create your environment file:**
+   ```bash
+   cp .env.example .env
+   ```
+2. **Edit `.env`** with your actual credentials:
+   ```
+   MLFLOW_TRACKING_USERNAME=your_dagshub_username
+   MLFLOW_TRACKING_PASSWORD=your_dagshub_token
+   ```
+   Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens)
+### Quick Start
+#### 1. Build and Start All Services
+Build both images and start the containers:
+```bash
+docker-compose up -d --build
+```
+| Flag | Description |
+|------|-------------|
+| `-d` | Run in detached mode (background) |
+| `--build` | Rebuild images before starting (use when code/Dockerfile changes) |
+**Available Services:**
+- **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs)
+- **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501)
+- **Health Check:** [http://localhost:8080/health](http://localhost:8080/health)
+#### 2. Stop All Services
+Stop and remove containers and networks:
+```bash
+docker-compose down
+```
+| Flag | Description |
+|------|-------------|
+| `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` |
+| `--rmi all` | Also remove images: `docker-compose down --rmi all` |
+#### 3. Restart Services
+After updating `.env` or configuration files:
+```bash
+docker-compose restart
+```
+Or for a full restart with environment reload:
+```bash
+docker-compose down
+docker-compose up -d
+```
+#### 4. Check Status
+View the status of all running services:
+```bash
+docker-compose ps
+```
+Or use Docker commands:
+```bash
+docker ps
+```
+#### 5. View Logs
+Tail logs from both services in real-time:
+```bash
+docker-compose logs -f
+```
+View logs from a specific service:
+```bash
+docker-compose logs -f hopcroft-api
+docker-compose logs -f hopcroft-gui
+```
+| Flag | Description |
+|------|-------------|
+| `-f` | Follow log output (stream new logs) |
+| `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` |
+#### 6. Execute Commands in Container
+Open an interactive shell inside a running container:
+```bash
+docker-compose exec hopcroft-api /bin/bash
+docker-compose exec hopcroft-gui /bin/bash
+```
+Examples of useful commands inside the API container:
+```bash
+# Check installed packages
+pip list
+# Run Python interactively
+python
+# Check model file exists
+ls -la /app/models/
+# Verify environment variables
+printenv | grep MLFLOW
+```
+```
+### Architecture Overview
+**Docker Compose orchestrates two services:**
+```
+docker-compose.yml
+├── hopcroft-api (FastAPI Backend)
+│   ├── Build: ./Dockerfile
+│   ├── Port: 8080:8080
+│   ├── Network: hopcroft-net
+│   ├── Environment: .env (MLflow credentials)
+│   ├── Volumes:
+│   │   ├── ./hopcroft_skill_classification_tool_competition (hot reload)
+│   │   └── hopcroft-logs:/app/logs (persistent logs)
+│   └── Health Check: /health endpoint
+│
+├── hopcroft-gui (Streamlit Frontend)
+│   ├── Build: ./Dockerfile.streamlit
+│   ├── Port: 8501:8501
+│   ├── Network: hopcroft-net
+│   ├── Environment: API_BASE_URL=http://hopcroft-api:8080
+│   ├── Volumes:
+│   │   └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload)
+│   └── Depends on: hopcroft-api (waits for health check)
+│
+└── hopcroft-net (bridge network)
+```
+**External Access:**
+- API: http://localhost:8080
+- GUI: http://localhost:8501
+**Internal Communication:**
+- GUI → API: http://hopcroft-api:8080 (via Docker network)
+### Services Description
+**hopcroft-api (FastAPI Backend)**
+- Purpose: FastAPI backend serving the ML model for skill classification
+- Image: Built from `Dockerfile`
+- Port: 8080 (maps to host 8080)
+- Features:
+  - Random Forest model with embedding features
+  - MLflow experiment tracking
+  - Auto-reload in development mode
+  - Health check endpoint
+**hopcroft-gui (Streamlit Frontend)**
+- Purpose: Streamlit web interface for interactive predictions
+- Image: Built from `Dockerfile.streamlit`
+- Port: 8501 (maps to host 8501)
+- Features:
+  - User-friendly interface for skill prediction
+  - Real-time communication with API
+  - Automatic reconnection on API restart
+  - Depends on API health before starting
+### Development vs Production
+**Development (default):**
+- Auto-reload enabled (`--reload`)
+- Source code mounted with bind mounts
+- Custom command with hot reload
+- GUI → API via Docker network
+**Production:**
+- Auto-reload disabled
+- Use built image only
+- Use Dockerfile's CMD
+- GUI → API via Docker network
+For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.
+### Troubleshooting
+#### Issue: GUI shows "API is not available"
+**Solution:**
+1. Wait 30-60 seconds for API to fully initialize and become healthy
+2. Refresh the GUI page (F5)
+3. Check API health: `curl http://localhost:8080/health`
+4. Check logs: `docker-compose logs hopcroft-api`
+#### Issue: "500 Internal Server Error" on predictions
+**Solution:**
+1. Verify MLflow credentials in `.env` are correct
+2. Restart services: `docker-compose down && docker-compose up -d`
+3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`
+#### Issue: Changes to code not reflected
+**Solution:**
+- For Python code changes: Auto-reload is enabled, wait a few seconds
+- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
+- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`
+#### Issue: Port already in use
+**Solution:**
+```bash
+# Check what's using the port
+netstat -ano | findstr :8080
+netstat -ano | findstr :8501
+# Stop existing containers
+docker-compose down
+# Or change ports in docker-compose.yml
+```
+## Demo UI (Streamlit)
+The Streamlit GUI provides an interactive web interface for the skill classification API.
+### Features
+- Real-time skill prediction from GitHub issue text
+- Top-5 predicted skills with confidence scores
+- Full predictions table with all skills
+- API connection status indicator
+- Responsive design
+### Usage
+1. Ensure both services are running: `docker-compose up -d`
+2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501)
+3. Enter a GitHub issue description in the text area
+4. Click "Predict Skills" to get predictions
+5. View results in the predictions table
+### Architecture
+- **Frontend**: Streamlit (Python web framework)
+- **Communication**: HTTP requests to FastAPI backend via Docker network
+- **Independence**: GUI and API run in separate containers
+- **Auto-reload**: GUI code changes are reflected immediately (bind mount)
+> Both must run **simultaneously** in different terminals/containers.
+### Quick Start
+1. **Start the FastAPI backend:**
+   ```bash
+   fastapi dev hopcroft_skill_classification_tool_competition/main.py
+   ```
+2. **In a new terminal, start Streamlit:**
+   ```bash
+   streamlit run streamlit_app.py
+   ```
+3. **Open your browser:**
+   - Streamlit UI: http://localhost:8501
+   - FastAPI Docs: http://localhost:8000/docs
+### Features
+- Interactive web interface for skill prediction
+- Real-time predictions with confidence scores
+- Adjustable confidence threshold
+- Multiple input modes (quick/detailed/examples)
+- Visual result display
+- API health monitoring
+### Demo Walkthrough
+#### Main Dashboard
+![gui_main_dashboard](docs/img/gui_main_dashboard.png)
+The main interface provides:
+- **Sidebar**: API health status, confidence threshold slider, model info
+- **Three input modes**: Quick Input, Detailed Input, Examples
+#### Quick Input Mode
+![gui_quick_input](docs/img/gui_quick_input.png)
+Simply paste your GitHub issue text and click "Predict Skills"!
+#### Prediction Results
+![gui_detailed](docs/img/gui_detailed.png)
+View:
+- **Top predictions** with confidence scores
+- **Full predictions table** with filtering
+- **Processing metrics** (time, model version)
+- **Raw JSON response** (expandable)
+#### Detailed Input Mode
+![gui_detailed_input](docs/img/gui_detailed_input.png)
+Add optional metadata:
+- Repository name
+- PR number
+- Detailed description
+#### Example Gallery
+![gui_ex](docs/img/gui_ex.png)
+Test with pre-loaded examples:
+- Authentication bugs
+- ML features
+- Database issues
+- UI enhancements
+### Usage
+1. Enter GitHub issue/PR text in the input area
+2. (Optional) Add description, repo name, PR number
+3. Click "Predict Skills"
+4. View results with confidence scores
+5. Adjust threshold slider to filter predictions

data/.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ /raw

data/README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+language:
+- en
+tags:
+- software-engineering
+- multi-label-classification
+- pull-requests
+- skills
+license: mit
+---
+# Dataset Card for SkillScope Dataset
+## Dataset Details
+- **Name:** SkillScope Dataset (NLBSE Tool Competition)
+- **Repository:** [NLBSE/SkillCompetition](https://huggingface.co/datasets/NLBSE/SkillCompetition)
+- **Version:** 1.0 (Processed for Hopcroft Project)
+- **Type:** Tabular / Text / Code
+- **Task:** Multi-label Classification
+- **Maintainers:** se4ai2526-uniba (Hopcroft Project Team)
+## Intended Use
+### Primary Intended Uses
+- Training and evaluating multi-label classification models to predict required skills for resolving GitHub issues/Pull Requests.
+- Analyzing the relationship between issue characteristics (title, body, code changes) and developer skills.
+- Benchmarking feature extraction techniques (TF-IDF vs. Embeddings) in Software Engineering contexts.
+### Out-of-Scope Use Cases
+- Profiling individual developers (the dataset focuses on issues/PRs, not user profiling).
+- General purpose code generation.
+## Dataset Contents
+The dataset consists of merged Pull Requests from 11 Java repositories.
+- **Total Samples (Raw):** 7,245 merged PRs
+- **Source Files:** 57,206
+- **Methods:** 59,644
+- **Classes:** 13,097
+- **Labels:** 217 distinct skill labels (domain/sub-domain pairs)
+### Schema
+The data is stored in a SQLite database (`skillscope_data.db`) with the following main structures:
+- `nlbse_tool_competition_data_by_issue`: Main table containing PR features (title, description, file paths) and skill labels.
+- `vw_nlbse_tool_competition_data_by_file`: View providing file-level granularity.
+## Context and Motivation
+### Motivation
+This dataset was created for the NLBSE (Natural Language-based Software Engineering) Tool Competition to foster research in automating skill identification in software maintenance. Accurately identifying required skills for an issue can help in automatic expert recommendation and task assignment.
+### Context
+The data is derived from open-source Java projects on GitHub. It represents real-world development scenarios where developers describe issues and implement fixes.
+## Dataset Creation and Preprocessing
+### Source Data
+The raw data is downloaded from the Hugging Face Hub (`NLBSE/SkillCompetition`).
+### Preprocessing Steps (Hopcroft Project)
+To ensure data quality for modeling, the following preprocessing steps are applied (via `data_cleaning.py`):
+1.  **Duplicate Removal:** ~6.5% of samples were identified as duplicates and removed.
+2.  **Conflict Resolution:** ~8.9% of samples had conflicting labels for identical features; resolved using majority voting.
+3.  **Rare Label Removal:** Labels with fewer than 5 occurrences were removed to ensure valid cross-validation.
+4.  **Feature Extraction:**
+    - **Text Cleaning:** Removal of URLs, HTML, Markdown, and normalization.
+    - **TF-IDF:** Uni-grams and bi-grams (max 5000 features).
+    - **Embeddings:** Sentence embeddings using `all-MiniLM-L6-v2`.
+5.  **Splitting:** 80/20 Train/Test split using `MultilabelStratifiedShuffleSplit` to maintain label distribution and prevent data leakage.
+## Considerations
+### Ethical Considerations
+- **Privacy:** The data comes from public GitHub repositories. No private or sensitive personal information is explicitly included, though developer names/IDs might be present in metadata.
+- **Bias:** The dataset is limited to Java repositories, so models may not generalize to other programming languages or ecosystems.
+### Caveats and Recommendations
+- **Label Imbalance:** The dataset is highly imbalanced (long-tail distribution of skills). Techniques like MLSMOTE or ADASYN are recommended.
+- **Multi-label Nature:** Most samples have multiple labels; evaluation metrics should account for this (e.g., Micro-F1).
+- **Text Noise:** PR descriptions can be noisy or sparse; robust preprocessing is essential.

data/processed/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ /tfidf
2	+ /embedding

data/processed/embedding.dvc ADDED Viewed

	@@ -0,0 +1,6 @@

+outs:
+- md5: d388f4e3ebe391bf4393cd327a16a1bb.dir
+  size: 64320416
+  nfiles: 12
+  hash: md5
+  path: embedding

data/processed/tfidf.dvc ADDED Viewed

	@@ -0,0 +1,6 @@

+outs:
+- md5: 038f64e03853a832a891a854146c429d.dir
+  nfiles: 11
+  hash: md5
+  path: tfidf
+  size: 199262804

data/raw.dvc ADDED Viewed

	@@ -0,0 +1,6 @@

+outs:
+- md5: 9a91536f747b03a232c8cfd354393541.dir
+  size: 440922112
+  nfiles: 1
+  hash: md5
+  path: raw

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,56 @@

+services:
+  hopcroft-api:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: hopcroft-api
+    ports:
+      - "8080:8080"
+    env_file:
+      - .env
+    environment:
+      - PROJECT_NAME=Hopcroft
+    volumes:
+      # Bind mount: enables live code reloading for development
+      - ./hopcroft_skill_classification_tool_competition:/app/hopcroft_skill_classification_tool_competition
+      # Named volume: persistent storage for application logs
+      - hopcroft-logs:/app/logs
+    networks:
+      - hopcroft-net
+    # Override CMD for development with auto-reload
+    command: >
+      uvicorn hopcroft_skill_classification_tool_competition.main:app  --host 0.0.0.0  --port 8080  --reload
+    restart: unless-stopped
+    healthcheck:
+      test: [ "CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health', timeout=5)" ]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+  hopcroft-gui:
+    build:
+      context: .
+      dockerfile: Dockerfile.streamlit
+    container_name: hopcroft-gui
+    ports:
+      - "8501:8501"
+    environment:
+      - API_BASE_URL=http://hopcroft-api:8080
+    volumes:
+      # Bind mount for development hot-reload
+      - ./hopcroft_skill_classification_tool_competition/streamlit_app.py:/app/streamlit_app.py
+    networks:
+      - hopcroft-net
+    depends_on:
+      hopcroft-api:
+        condition: service_healthy
+    restart: unless-stopped
+networks:
+  hopcroft-net:
+    driver: bridge
+volumes:
+  hopcroft-logs:
+    driver: local

docs/.gitkeep ADDED Viewed

File without changes

docs/ML Canvas.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# Machine Learning Canvas
+| Designed for | Designed by | Date | Iteration |
+|---|---|---|---|
+| NLBSE 2026 | Team Hopcroft | 13/10/2025 | 1 |
+## PREDICTION TASK
+The prediction task is a multi-label classification aimed at identifying the technical skills required to resolve a specific software issue. The input for the model is a dataset extracted from a GitHub pull request, which includes textual features (like the issue description), code-context information, and other metadata. The output is a set of one or more skill labels, chosen from a predefined set of 217 skills, representing the technical domains and sub-domains (e.g., "database," "security," "UI") needed for the resolution.
+## DECISIONS
+The predictions are used to make crucial operational decisions in software project management. The value for the end-user, such as a project manager or team lead, lies in the ability to automatically assign new issues to the most suitable developers—those who possess the skills identified by the model. This optimizes resource allocation, accelerates resolution times, and improves the overall efficiency of the development team.
+## VALUE PROPOSITION
+The Machine Learning system is designed for project managers and developers, aiming to optimize task assignment. By automatically predicting the technical skills (domains and sub-domains) required to resolve GitHub issues, the system ensures that each task is assigned to the most qualified developer.
+The primary value lies in a significant increase in the efficiency of the development process, leading to reduced resolution times and improved software quality.
+## DATA COLLECTION
+The core data was collected by the competition organizers through a mining process on historical GitHub pull requests. This process involved sourcing the issue text and associated source code from tasks that were already completed and merged. Each issue in the dataset then underwent a rigorous, automated labeling protocol, where skill labels (domains and sub-domains) were annotated based on the specific API calls detected within the source code. Due to the nature of software development tasks, the resulting dataset faces a significant class imbalance issue, with certain skill labels appearing far more frequently than others.
+## DATA SOURCES
+The ML system will leverage the official NLBSE’26 Skill Classification dataset, a comprehensive corpus released by the competition organizers. This dataset is sourced from 11 popular Java repositories and comprises 7,245 merged pull requests annotated with 217 distinct skill labels.  /
+All foundational data is provided in a SQLite database (`skillscope_data.db`), with the `nlbse_tool_competition_data_by_issue` table serving as the primary source for model training. The competition framework also permits the use of external GitHub APIs for supplementary data.
+## IMPACT SIMULATION
+The model's impact is validated by outperforming the specific "SkillScope Random Forest + TF-IDF" baseline on precision, recall, or micro-F1 scores. This evaluation is performed using the provided SQLite database of labeled pull requests as the ground truth to ensure measurable and superior performance.
+## MAKING PREDICTIONS
+As soon as a new issue is created, the system analyzes it in real-time to understand which technical skills are needed. Instead of waiting for a manual assignment, the system sends the task directly to the most suitable developer. This automated process is so fast that it ensures the right expert can start working on the problem without any delay.
+## BUILDING MODELS
+The ML system will start with the competition’s baseline multi-label classifier, which predicts the domains and sub-domains representing the skills needed for each issue. Model development will focus on iterative improvements to enhance the specified performance metrics.
+A new model will be trained until it achieves a statistically significant improvement in precision, recall, or micro-F1 score over the initial baseline, without degradation in the other metrics.
+Training will occur offline, with computational needs scaling by model complexity and data volume.
+## FEATURES
+Only the most important, non-null, and directly functional features will be selected. Textual data, such as the issue title and description, will be represented using established NLP techniques. We will also utilize numerical features, including the pull request number and the calculated issue duration. Skills will be encoded as binary multi-label vectors, and all features will be normalized to optimize model performance throughout iterative development cycles.
+## MONITORING
+System quality will be assessed by comparing the model's skill predictions with the actual skills used by developers to resolve issues. Performance will be continuously monitored using key metrics (precision, recall, micro-F1 score). To detect data drift, the model will be periodically evaluated on new, recent data; a significant drop in these metrics will indicate the need for retraining. The system's value is measured according to the competition's criteria: the primary value is the increase in the micro-F1 score (∆micro-F1) over the baseline, without worsening precision and recall. Computational efficiency (runtime) serves as a secondary value metric.

docs/README.md ADDED Viewed

	@@ -0,0 +1,12 @@

+Generating the docs
+----------
+Use [mkdocs](http://www.mkdocs.org/) structure to update the documentation.
+Build locally with:
+    mkdocs build
+Serve locally with:
+    mkdocs serve

docs/docs/getting-started.md ADDED Viewed

	@@ -0,0 +1,6 @@

+Getting started
+===============
+This is where you describe how to get set up on a clean install, including the
+commands necessary to get the raw data (using the `sync_data_from_s3` command,
+for example), and then how to make the cleaned, final data sets.

docs/docs/index.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Hopcroft_Skill-Classification-Tool-Competition documentation!
+## Description
+The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.
+## Commands
+The Makefile contains the central entry points for common tasks related to this project.

docs/mkdocs.yml ADDED Viewed

	@@ -0,0 +1,4 @@

+site_name: Hopcroft_Skill-Classification-Tool-Competition
+#
+site_author: Team Hopcroft
+#

docs/testing_and_validation.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# Testing and Validation Documentation
+This document provides a comprehensive and detailed overview of the testing and validation strategies employed in the Hopcroft project. It consolidates all technical details, execution commands, and analysis reports from Behavioral Testing, Deepchecks, Great Expectations, and Ruff.
+---
+## 1. Behavioral Testing
+**Report Source:** `reports/behavioral/`
+**Status:** All Tests Passed (36/36)
+**Last Run:** November 15, 2025
+**Model:** Random Forest + TF-IDF (SMOTE oversampling)
+**Execution Time:** ~8 minutes
+Behavioral testing evaluates the model's capabilities and robustness beyond simple accuracy metrics.
+### Test Categories & Results
+| Category | Tests | Status | Description |
+|----------|-------|--------|-------------|
+| **Invariance Tests** | 9 | **Passed** | Ensure model predictions remain stable under perturbations that shouldn't affect the outcome (e.g., changing variable names, minor typos). |
+| **Directional Tests** | 10 | **Passed** | Verify that specific changes to the input cause expected changes in the output (e.g., adding specific keywords should increase probability of related skills). |
+| **Minimum Functionality Tests** | 17 | **Passed** | Check basic capabilities and sanity checks (e.g., simple inputs produce valid outputs). |
+### Technical Notes
+- **Training Tests Excluded:** `test_model_training.py` was excluded from the run due to a missing PyTorch dependency in the environment, but the inference tests cover the model's behavior fully.
+- **Robustness:** The model demonstrates excellent consistency across all 36 behavioral scenarios.
+### How to Regenerate
+To run the behavioral tests and generate the JSON report:
+```bash
+python -m pytest tests/behavioral/ \
+  --ignore=tests/behavioral/test_model_training.py \
+  --json-report \
+  --json-report-file=reports/behavioral/behavioral_tests_report.json \
+  -v
+```
+---
+## 2. Deepchecks Validation
+**Report Source:** `reports/deepchecks/`
+**Status:** Cleaned Data is Production-Ready (Score: 96%)
+**Last Run:** November 16, 2025
+Deepchecks was used to validate the integrity of the dataset before and after cleaning. The validation process confirmed that the `data_cleaning.py` pipeline successfully resolved critical data quality issues.
+### Dataset Statistics: Before vs. After Cleaning
+| Metric | Before Cleaning | After Cleaning | Difference |
+|--------|-----------------|----------------|------------|
+| **Total Samples** | 7,154 | 6,673 | -481 duplicates (6.72%) |
+| **Duplicates** | 481 | **0** | **RESOLVED** |
+| **Data Leakage** | Present | **0 samples** | **RESOLVED** |
+| **Label Conflicts** | Present | **0** | **RESOLVED** |
+| **Train/Test Split** | N/A | 5,338 / 1,335 | 80/20 Stratified |
+### Validation Suites Detailed Results
+#### A. Data Integrity Suite (12 checks)
+**Score:** 92% (7 Passed, 2 Non-Critical Failures, 2 Null)
+*   **PASSED:** Data Duplicates (0), Conflicting Labels (0), Mixed Nulls, Mixed Data Types, String Mismatch, String Length, Feature Label Correlation.
+*   **FAILED (Non-Critical/Acceptable):**
+    1.  **Single Value in Column:** Some TF-IDF features are all zeros.
+    2.  **Feature-Feature Correlation:** High correlation between features.
+#### B. Train-Test Validation Suite (12 checks)
+**Score:** 100% (12 Passed)
+*   **PASSED (CRITICAL):** **Train Test Samples Mix (0 leakage)**.
+*   **PASSED:** Datasets Size Comparison (80/20), New Label in Test (0), Feature Drift (< 0.025), Label Drift (0.0), Multivariate Drift.
+### Interpretation of Results & Important Notes
+The validation identified two "failures" that are actually **expected behavior** for this type of data:
+1.  **Features with Only Zeros (Non-Critical):**
+    *   *Reason:* TF-IDF creates sparse features. If a specific word (feature) never appears in the specific subset being tested, its column will be all zeros.
+    *   *Impact:* None. The model simply ignores these features.
+2.  **High Feature Correlation (Non-Critical):**
+    *   *Reason:* Linguistic terms naturally co-occur (e.g., "machine" and "learning", "python" and "code").
+    *   *Impact:* Slight multicollinearity, which Random Forest handles well.
+### Recommendations & Next Steps
+1.  **Model Retraining:** Now that the data is cleaned and leakage-free, the models should be retrained to obtain reliable performance metrics.
+2.  **Continuous Monitoring:** Use `run_all_deepchecks.py` in CI/CD pipelines to prevent regression.
+### How to Use the Tests
+**Run Complete Validation (Recommended):**
+```bash
+python tests/deepchecks/run_all_deepchecks.py
+```
+**Run Specific Suites:**
+```bash
+# Data Integrity Only
+python tests/deepchecks/test_data_integrity.py
+# Train-Test Validation Only
+python tests/deepchecks/test_train_test_validation.py
+# Compare Original vs Cleaned
+python tests/deepchecks/run_all_tests_comparison.py
+```
+---
+## 3. Great Expectations Data Validation
+**Report Source:** `tests/great expectations/`
+**Status:** All 10 Tests Passed on Cleaned Data
+Great Expectations provides a rigorous suite of 10 tests to validate the data pipeline at various stages.
+### Detailed Test Descriptions
+#### TEST 1: Raw Database Validation
+*   **Purpose:** Validates integrity/schema of `nlbse_tool_competition_data_by_issue` table. Ensures data source integrity before expensive feature engineering.
+*   **Checks:** Row count (7000-10000), Column count (220-230), Required columns present.
+*   **Result:** **PASS**. Schema is valid.
+#### TEST 2: TF-IDF Feature Matrix Validation
+*   **Purpose:** Validates statistical properties of TF-IDF features. Ensures feature matrix is suitable for ML algorithms.
+*   **Checks:** No NaN/Inf, values >= 0, at least 1 non-zero feature per sample.
+*   **Original Data:** **FAIL** (25 samples had 0 features due to empty text).
+*   **Cleaned Data:** **PASS** (Sparse samples removed).
+#### TEST 3: Multi-Label Binary Format Validation
+*   **Purpose:** Ensures label matrix is binary {0,1} for MultiOutputClassifier. Missing labels would invalidate training.
+*   **Checks:** Values in {0,1}, correct dimensions.
+*   **Result:** **PASS**.
+#### TEST 4: Feature-Label Consistency Validation
+*   **Purpose:** Validates alignment between X and Y matrices. Misalignment causes catastrophic training failures.
+*   **Checks:** Row counts match, no empty vectors.
+*   **Original Data:** **FAIL** (Empty feature vectors present).
+*   **Cleaned Data:** **PASS** (Perfect alignment).
+#### TEST 5: Label Distribution & Stratification
+*   **Purpose:** Ensures labels have enough samples for stratified splitting. Labels with insufficient samples cause stratification failures.
+*   **Checks:** Min 5 occurrences per label.
+*   **Original Data:** **FAIL** (75 labels had 0 occurrences).
+*   **Cleaned Data:** **PASS** (Rare labels removed).
+#### TEST 6: Feature Sparsity & SMOTE Compatibility
+*   **Purpose:** Ensures feature density is sufficient for nearest-neighbor algorithms (SMOTE/ADASYN).
+*   **Checks:** Min 10 non-zero features per sample.
+*   **Original Data:** **FAIL** (31.5% samples < 10 features).
+*   **Cleaned Data:** **PASS** (Incompatible samples removed).
+#### TEST 7: Multi-Output Classifier Compatibility
+*   **Purpose:** Validates multi-label structure. Insufficient multi-label samples would indicate inappropriate architecture.
+*   **Checks:** >50% samples have multiple labels.
+*   **Result:** **PASS** (Strong multi-label characteristics).
+#### TEST 8: Duplicate Samples Detection
+*   **Purpose:** Detects duplicate feature vectors to prevent leakage.
+*   **Original Data:** **FAIL** (481 duplicates found).
+*   **Cleaned Data:** **PASS** (0 duplicates).
+#### TEST 9: Train-Test Separation Validation
+*   **Purpose:** **CRITICAL**. Validates no data leakage between train and test sets.
+*   **Checks:** Intersection of Train and Test sets must be empty.
+*   **Result:** **PASS** (Cleaned data only).
+#### TEST 10: Label Consistency Validation
+*   **Purpose:** Ensures identical features have identical labels. Inconsistency indicates ground truth errors.
+*   **Original Data:** **FAIL** (640 samples with conflicting labels).
+*   **Cleaned Data:** **PASS** (Resolved via majority voting).
+### Running the Tests
+```bash
+python "tests/great expectations/test_gx.py"
+```
+---
+## 4. Ruff Code Quality Analysis
+**Report Source:** `reports/ruff/`
+**Status:** All Issues Resolvable
+**Last Analysis:** November 17, 2025
+**Total Issues:** 28
+Static code analysis was performed using Ruff to ensure code quality and adherence to PEP 8 standards.
+### Issue Breakdown by File
+| File | Issues | Severity | Key Findings |
+|------|--------|----------|--------------|
+| `data_cleaning.py` | 16 | Low/Med | Unsorted imports (I001), Unused imports (F401), f-strings without placeholders (F541), Comparison to False (E712). |
+| `modeling/train.py` | 7 | Low/Med | Unused `SMOTE` import, Unused variable `n_labels`, f-strings. |
+| `features.py` | 2 | Low | Unused `nltk` import. |
+| `dataset.py` | 2 | Low | Unused `DB_PATH` import. |
+| `mlsmote.py` | 1 | Low | Unsorted imports. |
+### Configuration & Compliance
+*   **Command:** `ruff check . --output-format json --output-file reports/ruff/ruff_report.json`
+*   **Standards:** PEP 8 (Pass), Black compatible (line length 88), isort (Pass).
+*   **Fixability:** 100% of issues can be fixed (26 automatically, 2 manually).
+### Conclusion
+The project code quality is high, with only minor style and import issues that do not affect functionality but should be cleaned up for maintainability.

hopcroft_skill_classification_tool_competition/__init__.py ADDED Viewed

File without changes

hopcroft_skill_classification_tool_competition/api_models.py ADDED Viewed

	@@ -0,0 +1,221 @@

+"""
+Pydantic models for API data validation.
+Defines request and response schemas with validation rules.
+"""
+from datetime import datetime
+from typing import Optional
+from pydantic import BaseModel, ConfigDict, Field, field_serializer, field_validator
+class IssueInput(BaseModel):
+    """Input model for GitHub issue or pull request classification."""
+    issue_text: str = Field(
+        ...,
+        min_length=1,
+        description="Issue title text",
+        examples=["Fix bug in authentication module"],
+    )
+    issue_description: Optional[str] = Field(
+        default=None,
+        description="Issue body text",
+        examples=["The authentication module fails when handling expired tokens"],
+    )
+    repo_name: Optional[str] = Field(
+        default=None, description="Repository name", examples=["user/repo-name"]
+    )
+    pr_number: Optional[int] = Field(
+        default=None, ge=1, description="Pull request number", examples=[123]
+    )
+    created_at: Optional[datetime] = Field(
+        default=None, description="Issue creation timestamp", examples=["2024-01-15T10:30:00Z"]
+    )
+    author_name: Optional[str] = Field(
+        default=None, description="Issue author username", examples=["johndoe"]
+    )
+    @field_validator("issue_text", "issue_description")
+    @classmethod
+    def clean_text(cls, v: Optional[str]) -> Optional[str]:
+        """Validate and clean text fields."""
+        if v is None:
+            return v
+        v = v.strip()
+        if not v:
+            raise ValueError("Text cannot be empty or whitespace only")
+        return v
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "issue_text": "Add support for OAuth authentication",
+                "issue_description": "Implement OAuth 2.0 flow for third-party providers",
+                "repo_name": "myorg/myproject",
+                "pr_number": 456,
+                "author_name": "developer123",
+            }
+        }
+    )
+class SkillPrediction(BaseModel):
+    """Single skill prediction with confidence score."""
+    skill_name: str = Field(
+        ...,
+        description="Name of the predicted skill (domain/subdomain)",
+        examples=["Language/Java", "DevOps/CI-CD"],
+    )
+    confidence: float = Field(
+        ..., ge=0.0, le=1.0, description="Confidence score (0.0 to 1.0)", examples=[0.85]
+    )
+    model_config = ConfigDict(
+        json_schema_extra={"example": {"skill_name": "Language/Java", "confidence": 0.92}}
+    )
+class PredictionResponse(BaseModel):
+    """Response model for skill classification predictions."""
+    predictions: list[SkillPrediction] = Field(
+        default_factory=list, description="List of predicted skills with confidence scores"
+    )
+    num_predictions: int = Field(
+        ..., ge=0, description="Total number of predicted skills", examples=[5]
+    )
+    model_version: str = Field(default="1.0.0", description="Model version", examples=["1.0.0"])
+    processing_time_ms: Optional[float] = Field(
+        default=None, ge=0.0, description="Processing time in milliseconds", examples=[125.5]
+    )
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "predictions": [
+                    {"skill_name": "Language/Java", "confidence": 0.92},
+                    {"skill_name": "DevOps/CI-CD", "confidence": 0.78},
+                ],
+                "num_predictions": 2,
+                "model_version": "1.0.0",
+                "processing_time_ms": 125.5,
+            }
+        }
+    )
+class BatchIssueInput(BaseModel):
+    """Input model for batch prediction."""
+    issues: list[IssueInput] = Field(
+        ...,
+        min_length=1,
+        max_length=100,
+        description="Issues to classify (max 100)",
+    )
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "issues": [
+                    {
+                        "issue_text": "Fix authentication bug",
+                        "issue_description": "Users cannot login with OAuth",
+                    },
+                    {
+                        "issue_text": "Add database migration",
+                        "issue_description": "Create migration for new user table",
+                    },
+                ]
+            }
+        }
+    )
+class BatchPredictionResponse(BaseModel):
+    """Response model for batch predictions."""
+    results: list[PredictionResponse] = Field(
+        default_factory=list, description="Prediction results, one per issue"
+    )
+    total_issues: int = Field(..., ge=0, description="Number of issues processed", examples=[2])
+    total_processing_time_ms: Optional[float] = Field(
+        default=None, ge=0.0, description="Processing time in milliseconds", examples=[250.0]
+    )
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "results": [
+                    {
+                        "predictions": [{"skill_name": "Language/Java", "confidence": 0.92}],
+                        "num_predictions": 1,
+                        "model_version": "1.0.0",
+                    }
+                ],
+                "total_issues": 2,
+                "total_processing_time_ms": 250.0,
+            }
+        }
+    )
+class ErrorResponse(BaseModel):
+    """Error response model."""
+    error: str = Field(..., description="Error message", examples=["Invalid input"])
+    detail: Optional[str] = Field(
+        default=None, description="Detailed error", examples=["Field 'issue_text' is required"]
+    )
+    timestamp: datetime = Field(default_factory=datetime.now, description="Error timestamp")
+    @field_serializer("timestamp")
+    def serialize_timestamp(self, value: datetime) -> str:
+        return value.isoformat()
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "error": "Validation Error",
+                "detail": "issue_text: field required",
+                "timestamp": "2024-01-15T10:30:00Z",
+            }
+        }
+    )
+class HealthCheckResponse(BaseModel):
+    """Health check response model."""
+    status: str = Field(default="healthy", description="Service status", examples=["healthy"])
+    model_loaded: bool = Field(..., description="Model ready status", examples=[True])
+    version: str = Field(default="1.0.0", description="API version", examples=["1.0.0"])
+    timestamp: datetime = Field(default_factory=datetime.now, description="Timestamp")
+class PredictionRecord(PredictionResponse):
+    """Extended prediction model with metadata from MLflow."""
+    run_id: str = Field(..., description="MLflow Run ID")
+    timestamp: datetime = Field(..., description="Prediction timestamp")
+    input_text: Optional[str] = Field(default="", description="Input text classified")
+    model_config = ConfigDict(
+        json_schema_extra={
+            "example": {
+                "predictions": [
+                    {"skill_name": "Language/Java", "confidence": 0.92},
+                    {"skill_name": "DevOps/CI-CD", "confidence": 0.78},
+                ],
+                "num_predictions": 2,
+                "model_version": "1.0.0",
+                "processing_time_ms": 125.5,
+                "run_id": "a1b2c3d4e5f6",
+                "timestamp": "2024-01-15T10:30:00Z",
+                "input_text": "Fix bug in authentication module",
+            }
+        }
+    )

hopcroft_skill_classification_tool_competition/config.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""Configuration and constants for the project"""
+from pathlib import Path
+# Project paths
+PROJECT_DIR = Path(__file__).resolve().parents[1]
+DATA_DIR = PROJECT_DIR / "data"
+RAW_DATA_DIR = DATA_DIR / "raw"
+PROCESSED_DATA_DIR = DATA_DIR / "processed"
+MODELS_DIR = PROJECT_DIR / "models"
+REPORTS_DIR = PROJECT_DIR / "reports"
+# Dataset paths
+DB_PATH = RAW_DATA_DIR / "skillscope_data.db"
+# Data paths configuration for training
+# Updated to use cleaned data (duplicates removed, no data leakage)
+# Now pointing to TF-IDF features for API compatibility
+DATA_PATHS = {
+    "features": str(PROCESSED_DATA_DIR / "tfidf" / "features_tfidf.npy"),
+    "labels": str(PROCESSED_DATA_DIR / "tfidf" / "labels_tfidf.npy"),
+    "features_original": str(PROCESSED_DATA_DIR / "tfidf" / "features_tfidf.npy"),
+    "labels_original": str(PROCESSED_DATA_DIR / "tfidf" / "labels_tfidf.npy"),
+    "models_dir": str(MODELS_DIR),
+}
+# Embedding configuration
+EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
+# API Configuration - which model to use for predictions
+API_CONFIG = {
+    # Model file to load (without path, just filename)
+    "model_name": "random_forest_embedding_gridsearch.pkl",
+    # Feature type: "tfidf" or "embedding"
+    # This determines how text is transformed before prediction
+    "feature_type": "embedding",
+}
+# Training configuration
+TRAINING_CONFIG = {
+    "random_state": 42,
+    "test_size": 0.2,
+    "val_size": 0.1,
+    "cv_folds": 5,
+}
+# Model configuration (Random Forest)
+MODEL_CONFIG = {
+    "param_grid": {
+        "estimator__n_estimators": [50, 100, 200],
+        "estimator__max_depth": [10, 20, 30],
+        "estimator__min_samples_split": [2, 5],
+    }
+}
+# ADASYN configuration
+ADASYN_CONFIG = {
+    "n_neighbors": 5,
+    "sampling_strategy": "auto",
+}
+# PCA configuration
+PCA_CONFIG = {
+    "variance_retained": 0.95,
+}
+# MLflow configuration
+MLFLOW_CONFIG = {
+    "uri": "https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow",
+    "experiments": {
+        "baseline": "hopcroft_random_forest_baseline",
+        "smote": "hopcroft_random_forest_smote",
+        "ros": "hopcroft_random_forest_ros",
+        "adasyn_pca": "hopcroft_random_forest_adasyn_pca",
+        "lightgbm": "hopcroft_lightgbm",
+        "lightgbm_smote": "hopcroft_lightgbm_smote",
+    },
+}
+# Model parameters (legacy - kept for compatibility)
+RANDOM_STATE = 42
+TEST_SIZE = 0.2
+VAL_SIZE = 0.1
+# Feature engineering
+MAX_TFIDF_FEATURES = 5000
+NGRAM_RANGE = (1, 2)
+# Model training (legacy)
+N_ESTIMATORS = 100
+MAX_DEPTH = 20
+# Hugging Face dataset
+HF_REPO_ID = "NLBSE/SkillCompetition"
+HF_FILENAME = "skillscope_data.zip"
+def get_feature_paths(feature_type: str = "embedding", use_cleaned: bool = True) -> dict:
+    """
+    Get data paths for specified feature type.
+    This function allows easy switching between TF-IDF and Embedding features
+    for baseline reproduction (TF-IDF) vs improved model (Embeddings).
+    Args:
+        feature_type: Type of features - 'tfidf' or 'embedding'
+        use_cleaned: If True, use cleaned data (duplicates removed, no leakage).
+                     If False, use original processed data.
+    Returns:
+        Dictionary with paths to features, labels, and models directory
+    Example:
+        # For baseline (paper reproduction)
+        paths = get_feature_paths(feature_type='tfidf', use_cleaned=True)
+        # For improved model
+        paths = get_feature_paths(feature_type='embedding', use_cleaned=True)
+    """
+    if feature_type not in ["tfidf", "embedding"]:
+        raise ValueError(f"Invalid feature_type: {feature_type}. Must be 'tfidf' or 'embedding'")
+    feature_dir = PROCESSED_DATA_DIR / feature_type
+    if use_cleaned:
+        suffix = "_clean"
+    else:
+        suffix = ""
+    return {
+        "features": str(feature_dir / f"features_{feature_type}{suffix}.npy"),
+        "labels": str(feature_dir / f"labels_{feature_type}{suffix}.npy"),
+        "features_test": str(feature_dir / f"X_test_{feature_type}{suffix}.npy"),
+        "labels_test": str(feature_dir / f"Y_test_{feature_type}{suffix}.npy"),
+        "models_dir": str(MODELS_DIR),
+        "feature_type": feature_type,
+    }

hopcroft_skill_classification_tool_competition/data_cleaning.py ADDED Viewed

	@@ -0,0 +1,559 @@

+"""
+Data Cleaning and Quality Assurance Module
+This module addresses data quality issues identified by Deepchecks validation:
+1. Removes duplicate samples (6.5% duplicates detected)
+2. Resolves conflicting labels (8.9% samples with conflicts)
+3. Ensures proper train/test split without data leakage
+4. Removes highly correlated features
+This script should be run BEFORE training to ensure data quality.
+It regenerates the processed data files with cleaned data.
+Usage:
+    python -m hopcroft_skill_classification_tool_competition.data_cleaning
+Output:
+    - data/processed/tfidf/features_tfidf_clean.npy (cleaned training features)
+    - data/processed/tfidf/labels_tfidf_clean.npy (cleaned training labels)
+    - data/processed/tfidf/X_test_clean.npy (cleaned test features)
+    - data/processed/tfidf/Y_test_clean.npy (cleaned test labels)
+"""
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+import numpy as np
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from hopcroft_skill_classification_tool_competition.config import PROCESSED_DATA_DIR
+def remove_duplicates(X: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray, Dict]:
+    """
+    Remove duplicate samples from the dataset.
+    Duplicates are identified by identical feature vectors.
+    When duplicates are found with different labels, we keep the first occurrence.
+    Args:
+        X: Feature matrix (samples x features)
+        y: Label matrix (samples x labels)
+    Returns:
+        Tuple of (cleaned_X, cleaned_y, stats_dict)
+    """
+    print("\n" + "=" * 80)
+    print("STEP 1: REMOVING DUPLICATES")
+    print("=" * 80)
+    initial_samples = X.shape[0]
+    # Convert to DataFrame for easier duplicate detection
+    # Use feature hash to identify duplicates (more memory efficient than full comparison)
+    df_features = pd.DataFrame(X)
+    # Find duplicates based on all features
+    duplicates_mask = df_features.duplicated(keep="first")
+    n_duplicates = duplicates_mask.sum()
+    print(f"Initial samples: {initial_samples:,}")
+    print(f"Duplicates found: {n_duplicates:,} ({n_duplicates / initial_samples * 100:.2f}%)")
+    if n_duplicates > 0:
+        # Keep only non-duplicate rows
+        X_clean = X[~duplicates_mask]
+        y_clean = y[~duplicates_mask]
+        print(f"Samples after removing duplicates: {X_clean.shape[0]:,}")
+        print(f"Removed: {n_duplicates:,} duplicate samples")
+    else:
+        X_clean = X
+        y_clean = y
+        print("No duplicates found")
+    stats = {
+        "initial_samples": int(initial_samples),
+        "duplicates_found": int(n_duplicates),
+        "duplicates_percentage": float(n_duplicates / initial_samples * 100),
+        "final_samples": int(X_clean.shape[0]),
+    }
+    return X_clean, y_clean, stats
+def resolve_conflicting_labels(
+    X: np.ndarray, y: np.ndarray
+) -> Tuple[np.ndarray, np.ndarray, Dict]:
+    """
+    Resolve samples with conflicting labels.
+    Conflicting labels occur when identical feature vectors have different labels.
+    Resolution strategy: Use majority voting for each label across duplicates.
+    Args:
+        X: Feature matrix (samples x features)
+        y: Label matrix (samples x labels)
+    Returns:
+        Tuple of (cleaned_X, cleaned_y, stats_dict)
+    """
+    print("\n" + "=" * 80)
+    print("STEP 2: RESOLVING CONFLICTING LABELS")
+    print("=" * 80)
+    initial_samples = X.shape[0]
+    # Create a combined DataFrame
+    df_X = pd.DataFrame(X)
+    df_y = pd.DataFrame(y)
+    # Add a unique identifier based on features (use hash for efficiency)
+    # Create a string representation of each row
+    feature_hashes = pd.util.hash_pandas_object(df_X, index=False)
+    # Group by feature hash
+    groups = df_y.groupby(feature_hashes)
+    # Count conflicts: groups with size > 1
+    conflicts = groups.size()
+    n_conflict_groups = (conflicts > 1).sum()
+    n_conflict_samples = (conflicts[conflicts > 1]).sum()
+    print(f"Initial samples: {initial_samples:,}")
+    print(f"Duplicate feature groups: {n_conflict_groups:,}")
+    print(
+        f"Samples in conflict groups: {n_conflict_samples:,} ({n_conflict_samples / initial_samples * 100:.2f}%)"
+    )
+    if n_conflict_groups > 0:
+        # Resolve conflicts using majority voting
+        # For each group of duplicates, use the most common label value
+        resolved_labels = groups.apply(
+            lambda x: x.mode(axis=0).iloc[0] if len(x) > 1 else x.iloc[0]
+        )
+        # Keep only one sample per unique feature vector
+        unique_indices = ~df_X.duplicated(keep="first")
+        X_clean = X[unique_indices]
+        # Map resolved labels back to unique samples
+        unique_hashes = feature_hashes[unique_indices]
+        y_clean = np.array([resolved_labels.loc[h].values for h in unique_hashes])
+        print(f"Samples after conflict resolution: {X_clean.shape[0]:,}")
+        print("Conflicts resolved using majority voting")
+    else:
+        X_clean = X
+        y_clean = y
+        print("No conflicting labels found")
+    stats = {
+        "initial_samples": int(initial_samples),
+        "conflict_groups": int(n_conflict_groups),
+        "conflict_samples": int(n_conflict_samples),
+        "conflict_percentage": float(n_conflict_samples / initial_samples * 100),
+        "final_samples": int(X_clean.shape[0]),
+    }
+    return X_clean, y_clean, stats
+def remove_sparse_samples(
+    X: np.ndarray, y: np.ndarray, min_nnz: int = 10
+) -> Tuple[np.ndarray, np.ndarray, Dict]:
+    """
+    Remove samples with too few non-zero features (incompatible with SMOTE).
+    Args:
+        X: Feature matrix
+        y: Label matrix
+        min_nnz: Minimum number of non-zero features required
+    Returns:
+        Tuple of (X_filtered, y_filtered, statistics_dict)
+    """
+    print("\n" + "=" * 80)
+    print(f"STEP 3: REMOVING SPARSE SAMPLES (min_nnz={min_nnz})")
+    print("=" * 80)
+    n_initial = X.shape[0]
+    print(f"Initial samples: {n_initial:,}")
+    nnz_counts = (X != 0).sum(axis=1)
+    valid_mask = nnz_counts >= min_nnz
+    X_filtered = X[valid_mask]
+    y_filtered = y[valid_mask]
+    n_removed = n_initial - X_filtered.shape[0]
+    removal_pct = (n_removed / n_initial * 100) if n_initial > 0 else 0
+    print(f"Sparse samples (< {min_nnz} features): {n_removed:,} ({removal_pct:.2f}%)")
+    print(f"Samples after filtering: {X_filtered.shape[0]:,}")
+    stats = {
+        "initial_samples": int(n_initial),
+        "min_nnz_threshold": min_nnz,
+        "sparse_samples_removed": int(n_removed),
+        "removal_percentage": float(removal_pct),
+        "final_samples": int(X_filtered.shape[0]),
+    }
+    return X_filtered, y_filtered, stats
+def remove_empty_labels(
+    X: np.ndarray, y: np.ndarray, min_count: int = 5
+) -> Tuple[np.ndarray, np.ndarray, Dict]:
+    """
+    Remove labels with too few occurrences (cannot be stratified).
+    Args:
+        X: Feature matrix
+        y: Label matrix
+        min_count: Minimum number of occurrences required per label
+    Returns:
+        Tuple of (X_same, y_filtered, statistics_dict)
+    """
+    print("\n" + "=" * 80)
+    print(f"STEP 4: REMOVING RARE LABELS (min_count={min_count})")
+    print("=" * 80)
+    n_initial_labels = y.shape[1]
+    print(f"Initial labels: {n_initial_labels:,}")
+    label_counts = y.sum(axis=0)
+    valid_labels = label_counts >= min_count
+    y_filtered = y[:, valid_labels]
+    n_removed = n_initial_labels - y_filtered.shape[1]
+    removal_pct = (n_removed / n_initial_labels * 100) if n_initial_labels > 0 else 0
+    print(f"Rare labels (< {min_count} occurrences): {n_removed:,} ({removal_pct:.2f}%)")
+    print(f"Labels after filtering: {y_filtered.shape[1]:,}")
+    stats = {
+        "initial_labels": int(n_initial_labels),
+        "min_count_threshold": min_count,
+        "rare_labels_removed": int(n_removed),
+        "removal_percentage": float(removal_pct),
+        "final_labels": int(y_filtered.shape[1]),
+    }
+    return X, y_filtered, stats
+def create_clean_train_test_split(
+    X: np.ndarray, y: np.ndarray, test_size: float = 0.2, random_state: int = 42
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, Dict]:
+    """
+    Create train/test split with verification of no data leakage.
+    Uses MultilabelStratifiedShuffleSplit if available.
+    Args:
+        X: Feature matrix
+        y: Label matrix
+        test_size: Proportion of test set (default: 0.2 = 20%)
+        random_state: Random seed for reproducibility
+    Returns:
+        Tuple of (X_train, X_test, y_train, y_test, stats_dict)
+    """
+    print("\n" + "=" * 80)
+    print("STEP 5: CREATING CLEAN TRAIN/TEST SPLIT")
+    print("=" * 80)
+    print(f"Total samples: {X.shape[0]:,}")
+    print(f"Test size: {test_size * 100:.1f}%")
+    print(f"Random state: {random_state}")
+    # Try to use iterative-stratification for better multi-label splits
+    try:
+        from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
+        has_iterstrat = True
+        print("Using MultilabelStratifiedShuffleSplit (iterative-stratification)")
+    except ImportError:
+        has_iterstrat = False
+        print(
+            "WARNING: iterative-stratification not installed. Using standard stratification (suboptimal for multi-label)."
+        )
+    if has_iterstrat:
+        msss = MultilabelStratifiedShuffleSplit(
+            n_splits=1, test_size=test_size, random_state=random_state
+        )
+        train_index, test_index = next(msss.split(X, y))
+        X_train, X_test = X[train_index], X[test_index]
+        y_train, y_test = y[train_index], y[test_index]
+    else:
+        # Fallback: Perform stratified split based on first label column (approximate stratification)
+        stratify_column = y[:, 0] if y.ndim > 1 else y
+        X_train, X_test, y_train, y_test = train_test_split(
+            X, y, test_size=test_size, random_state=random_state, stratify=stratify_column
+        )
+    # Verify no data leakage: check for overlapping samples
+    print("\nVerifying no data leakage...")
+    # Convert to sets of row hashes for efficient comparison
+    train_hashes = set(pd.util.hash_pandas_object(pd.DataFrame(X_train), index=False))
+    test_hashes = set(pd.util.hash_pandas_object(pd.DataFrame(X_test), index=False))
+    overlap = train_hashes & test_hashes
+    if len(overlap) > 0:
+        raise ValueError(
+            f"DATA LEAKAGE DETECTED: {len(overlap)} samples appear in both train and test!"
+        )
+    print("No data leakage detected")
+    print(f"Train samples: {X_train.shape[0]:,} ({X_train.shape[0] / X.shape[0] * 100:.1f}%)")
+    print(f"Test samples: {X_test.shape[0]:,} ({X_test.shape[0] / X.shape[0] * 100:.1f}%)")
+    # Verify feature dimensions match
+    if X_train.shape[1] != X_test.shape[1]:
+        raise ValueError(
+            f"Feature dimensions don't match: train={X_train.shape[1]}, test={X_test.shape[1]}"
+        )
+    print(f"Feature dimensions match: {X_train.shape[1]:,}")
+    stats = {
+        "total_samples": int(X.shape[0]),
+        "train_samples": int(X_train.shape[0]),
+        "test_samples": int(X_test.shape[0]),
+        "train_percentage": float(X_train.shape[0] / X.shape[0] * 100),
+        "test_percentage": float(X_test.shape[0] / X.shape[0] * 100),
+        "features": int(X_train.shape[1]),
+        "labels": int(y_train.shape[1]) if y_train.ndim > 1 else 1,
+        "data_leakage": False,
+        "overlap_samples": 0,
+        "stratification_method": "MultilabelStratifiedShuffleSplit"
+        if has_iterstrat
+        else "Standard StratifiedShuffleSplit",
+    }
+    return X_train, X_test, y_train, y_test, stats
+def save_cleaned_data(
+    X_train: np.ndarray,
+    X_test: np.ndarray,
+    y_train: np.ndarray,
+    y_test: np.ndarray,
+    stats: Dict,
+    output_dir: Optional[Path] = None,
+    feature_type: str = "tfidf",
+) -> None:
+    """
+    Save cleaned train/test split to disk.
+    Args:
+        X_train: Training features
+        X_test: Test features
+        y_train: Training labels
+        y_test: Test labels
+        stats: Dictionary with cleaning statistics
+        output_dir: Output directory (default: data/processed/{feature_type}/)
+        feature_type: Type of features ('tfidf' or 'embedding')
+    """
+    print("\n" + "=" * 80)
+    print("STEP 6: SAVING CLEANED DATA")
+    print("=" * 80)
+    if output_dir is None:
+        output_dir = PROCESSED_DATA_DIR / feature_type
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Save cleaned data with "_clean" suffix
+    files = {
+        "features_train": output_dir / f"features_{feature_type}_clean.npy",
+        "labels_train": output_dir / f"labels_{feature_type}_clean.npy",
+        "features_test": output_dir / f"X_test_{feature_type}_clean.npy",
+        "labels_test": output_dir / f"Y_test_{feature_type}_clean.npy",
+    }
+    np.save(files["features_train"], X_train)
+    np.save(files["labels_train"], y_train)
+    np.save(files["features_test"], X_test)
+    np.save(files["labels_test"], y_test)
+    print(f"\nSaved cleaned data to: {output_dir}")
+    for name, path in files.items():
+        print(f"  - {path.name}")
+def clean_and_split_data(
+    test_size: float = 0.2,
+    random_state: int = 42,
+    regenerate_features: bool = True,
+    feature_type: str = "embedding",  # 'tfidf' or 'embedding'
+    model_name: str = "all-MiniLM-L6-v2",
+    max_features: int = 2000,  # Only for TF-IDF (must match features.py default)
+) -> Dict:
+    """
+    Main function to clean data and create proper train/test split.
+    This function:
+    1. Loads or regenerates features (TF-IDF or Embeddings)
+    2. Removes duplicate samples
+    3. Resolves conflicting labels
+    4. Creates clean train/test split
+    5. Verifies no data leakage
+    6. Saves cleaned data
+    Args:
+        test_size: Proportion of test set (default: 0.2)
+        random_state: Random seed for reproducibility (default: 42)
+        regenerate_features: If True, regenerate features from database (default: True)
+        feature_type: Type of features to extract ('tfidf' or 'embedding')
+        model_name: Model name for embeddings
+        max_features: Maximum number of TF-IDF features (default: 1000)
+    Returns:
+        Dictionary with all cleaning statistics
+    """
+    print("=" * 80)
+    print("DATA CLEANING AND QUALITY ASSURANCE PIPELINE")
+    print("=" * 80)
+    print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"Test size: {test_size * 100:.1f}%")
+    print(f"Random state: {random_state}")
+    print(f"Regenerate features: {regenerate_features}")
+    print(f"Feature type: {feature_type}")
+    if feature_type == "embedding":
+        print(f"Model name: {model_name}")
+    else:
+        print(f"Max features: {max_features}")
+    # Step 0: Load or generate features
+    if regenerate_features:
+        print("\nRegenerating features from database...")
+        # Load data and extract features
+        from hopcroft_skill_classification_tool_competition.features import create_feature_dataset
+        # Use the unified create_feature_dataset function
+        features, labels, _, _ = create_feature_dataset(
+            save_processed=False,  # Don't save intermediate raw features, just return them
+            feature_type=feature_type,
+            model_name=model_name,
+        )
+        X = features
+        y = labels.values
+    else:
+        print(f"\nLoading existing features ({feature_type})...")
+        data_dir = PROCESSED_DATA_DIR / feature_type
+        X = np.load(data_dir / f"features_{feature_type}.npy")
+        y = np.load(data_dir / f"labels_{feature_type}.npy")
+    print("\nInitial data shape:")
+    print(f"  Features: {X.shape}")
+    print(f"  Labels: {y.shape}")
+    # Step 1: Remove duplicates
+    X_no_dup, y_no_dup, dup_stats = remove_duplicates(X, y)
+    # Step 2: Resolve conflicting labels
+    X_no_conf, y_no_conf, conflict_stats = resolve_conflicting_labels(X_no_dup, y_no_dup)
+    # Step 3: Remove sparse samples
+    # For embeddings, we don't have "sparse" features in the same way as TF-IDF (zeros).
+    # But we can check for near-zero vectors if needed.
+    # For now, we skip sparse check for embeddings or keep it if it checks for all-zeros.
+    if feature_type == "tfidf":
+        X_no_sparse, y_no_sparse, sparse_stats = remove_sparse_samples(
+            X_no_conf, y_no_conf, min_nnz=10
+        )
+    else:
+        # Skip sparse check for embeddings as they are dense
+        X_no_sparse, y_no_sparse = X_no_conf, y_no_conf
+        sparse_stats = {"sparse_samples_removed": 0, "removal_percentage": 0.0}
+        print("\nSkipping sparse sample removal for dense embeddings.")
+    # Step 4: Remove rare labels
+    X_clean, y_clean, rare_stats = remove_empty_labels(X_no_sparse, y_no_sparse, min_count=5)
+    # Step 5: Create clean train/test split
+    X_train, X_test, y_train, y_test, split_stats = create_clean_train_test_split(
+        X_clean, y_clean, test_size=test_size, random_state=random_state
+    )
+    # Step 6: Save cleaned data
+    all_stats = {
+        "duplicates": dup_stats,
+        "conflicts": conflict_stats,
+        "sparse_samples": sparse_stats,
+        "rare_labels": rare_stats,
+        "split": split_stats,
+        "feature_type": feature_type,
+    }
+    # Save to specific directory based on feature type
+    output_dir = PROCESSED_DATA_DIR / feature_type
+    save_cleaned_data(
+        X_train,
+        X_test,
+        y_train,
+        y_test,
+        all_stats,
+        output_dir=output_dir,
+        feature_type=feature_type,
+    )
+    # Print final summary
+    print("\n" + "=" * 80)
+    print("CLEANING PIPELINE COMPLETED SUCCESSFULLY")
+    print("=" * 80)
+    print("\nSummary:")
+    print(f"  Original samples: {X.shape[0]:,}")
+    print(f"  Original labels: {y.shape[1]:,}")
+    print(
+        f"  Duplicates removed: {dup_stats['duplicates_found']:,} ({dup_stats['duplicates_percentage']:.2f}%)"
+    )
+    print(
+        f"  Conflicts resolved: {conflict_stats['conflict_samples']:,} ({conflict_stats['conflict_percentage']:.2f}%)"
+    )
+    print(
+        f"  Sparse samples removed: {sparse_stats['sparse_samples_removed']:,} ({sparse_stats['removal_percentage']:.2f}%)"
+    )
+    print(
+        f"  Rare labels removed: {rare_stats['rare_labels_removed']:,} ({rare_stats['removal_percentage']:.2f}%)"
+    )
+    print(f"  Final clean samples: {split_stats['total_samples']:,}")
+    print(f"  Final clean labels: {y_clean.shape[1]:,}")
+    print(
+        f"  Train samples: {split_stats['train_samples']:,} ({split_stats['train_percentage']:.1f}%)"
+    )
+    print(
+        f"  Test samples: {split_stats['test_samples']:,} ({split_stats['test_percentage']:.1f}%)"
+    )
+    print("\nData quality issues resolved:")
+    print("  - Duplicates removed")
+    print("  - Label conflicts resolved")
+    if feature_type == "tfidf":
+        print("  - Sparse samples removed")
+    print("  - Rare labels removed")
+    print("  - Clean train/test split created")
+    print("  - No data leakage verified")
+    print("=" * 80)
+    return all_stats
+if __name__ == "__main__":
+    # Run the cleaning pipeline
+    stats = clean_and_split_data(
+        test_size=0.2,  # 80/20 split
+        random_state=42,
+        regenerate_features=True,
+        feature_type="embedding",
+        model_name="all-MiniLM-L6-v2",
+    )

hopcroft_skill_classification_tool_competition/dataset.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""Scripts to download or generate data"""
+from pathlib import Path
+import shutil
+import zipfile
+from huggingface_hub import hf_hub_download
+from hopcroft_skill_classification_tool_competition.config import (
+    HF_FILENAME,
+    HF_REPO_ID,
+    RAW_DATA_DIR,
+)
+def download_skillscope_dataset(output_dir: Path = None) -> Path:
+    """
+    Download and extract SkillScope dataset from Hugging Face Hub.
+    The dataset contains a SQLite database (skillscope_data.db) with:
+    - nlbse_tool_competition_data_by_issue: Main table with PR features and skill labels
+    - vw_nlbse_tool_competition_data_by_file: View with file-level labels
+    Dataset details:
+    - 7,245 merged pull requests from 11 Java repositories
+    - 57,206 source files; 59,644 methods; 13,097 classes
+    - 217 skill labels (domain/sub-domain pairs)
+    Args:
+        output_dir: Directory where to save the dataset (default: data/raw)
+    Returns:
+        Path to the extracted database file
+    """
+    if output_dir is None:
+        output_dir = RAW_DATA_DIR
+    output_dir.mkdir(parents=True, exist_ok=True)
+    db_path = output_dir / "skillscope_data.db"
+    if db_path.exists():
+        print(f"Database already exists at: {db_path}")
+        return db_path
+    print("Downloading SkillScope dataset from Hugging Face...")
+    # Download without using cache - use local_dir to avoid .cache folder
+    zip_path = hf_hub_download(
+        repo_id=HF_REPO_ID,
+        filename=HF_FILENAME,
+        repo_type="dataset",
+        local_dir=output_dir,
+        local_dir_use_symlinks=False,  # Don't create symlinks, copy directly
+    )
+    print(f"Downloaded to: {zip_path}")
+    print("Extracting database...")
+    with zipfile.ZipFile(zip_path, "r") as zip_ref:
+        zip_ref.extractall(output_dir)
+    if not db_path.exists():
+        raise FileNotFoundError(f"Database file not found at: {db_path}")
+    print(f"Database extracted to: {db_path}")
+    # Clean up: remove zip file
+    print("Cleaning up temporary files...")
+    Path(zip_path).unlink()
+    # Clean up: remove .cache folder if exists
+    cache_dir = output_dir / ".cache"
+    if cache_dir.exists():
+        shutil.rmtree(cache_dir)
+        print("Removed .cache folder")
+    # Clean up: remove download folder if exists
+    download_dir = output_dir / "download"
+    if download_dir.exists():
+        shutil.rmtree(download_dir)
+        print("Removed download folder")
+    print("Cleanup completed")
+    print("\nDataset info:")
+    print("  - Table: nlbse_tool_competition_data_by_issue")
+    print("  - View: vw_nlbse_tool_competition_data_by_file")
+    return db_path
+if __name__ == "__main__":
+    print("=" * 80)
+    print("SKILLSCOPE DATASET DOWNLOAD")
+    print("=" * 80)
+    download_skillscope_dataset()
+    print("=" * 80)
+    print("DOWNLOAD COMPLETED")
+    print("=" * 80)

hopcroft_skill_classification_tool_competition/features.py ADDED Viewed

	@@ -0,0 +1,492 @@

+"""
+Feature extraction module for skill classification.
+This module provides functions to extract features from the SkillScope dataset,
+starting with TF-IDF vectorization of textual data from pull request issues.
+Dataset Information (from nlbse_tool_competition_data_by_issue):
+- 7,154 issues from 11 Java repositories
+        - 226 total columns:
+            - 2 text columns: 'issue text' (title) and 'issue description' (body)
+            - metadata and other columns containing PR/file/context information
+            - 217 label columns: domain/subdomain skill labels (142 active labels in this DB)
+Label Characteristics:
+- Multi-label classification problem
+- Average 32.9 labels per issue (median: 31)
+- Highly imbalanced: some labels appear in all issues, others in very few
+- Top labels: Language, Data Structure, DevOps, Error Handling
+"""
+from pathlib import Path
+import re
+import sqlite3
+from typing import Optional, Tuple
+import joblib
+# Import per lo Stemming
+from nltk.stem import PorterStemmer
+import numpy as np
+import pandas as pd
+from sklearn.feature_extraction.text import TfidfVectorizer
+from hopcroft_skill_classification_tool_competition.config import (
+    MODELS_DIR,
+    PROCESSED_DATA_DIR,
+    RAW_DATA_DIR,
+)
+# Inizializza lo stemmer una volta per efficienza
+stemmer = PorterStemmer()
+def clean_github_text(text: str, use_stemming: bool = True) -> str:
+    """
+    Clean GitHub issue text as per SkillScope paper (Aracena et al. process).
+    Removes emojis, URLs, HTML tags, and other noise commonly found in GitHub text.
+    Optionally applies stemming.
+    Args:
+        text: Raw text from GitHub issue
+        use_stemming: If True, apply Porter stemming (recommended for TF-IDF).
+                     If False, keep original words (recommended for Embeddings/LLMs).
+    Returns:
+        Cleaned text string (stemmed if use_stemming=True)
+    """
+    if pd.isna(text) or text is None:
+        return ""
+    text = str(text)
+    # Remove URLs (http/httpss/www)
+    text = re.sub(r"http\S+|www\.\S+", "", text)
+    # Remove HTML tags
+    text = re.sub(r"<[^>]+>", "", text)
+    # Remove markdown code blocks
+    text = re.sub(r"```[\s\S]*?```", "", text)
+    # Remove inline code
+    text = re.sub(r"`[^`]*`", "", text)
+    # Remove emojis and non-ASCII characters
+    text = text.encode("ascii", "ignore").decode("ascii")
+    # Remove extra whitespace
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    # Stemming condizionale: solo per TF-IDF, non per Embeddings
+    if use_stemming:
+        try:
+            tokens = text.split()
+            stemmed_tokens = [stemmer.stem(token) for token in tokens]
+            text = " ".join(stemmed_tokens)
+        except Exception as e:
+            print(f"Warning: Stemming failed for text snippet '{text[:50]}...'. Error: {e}")
+            # Ritorna il testo pulito ma non stemmato in caso di errore
+            return text.strip()
+    return text
+def get_dataset_info(df: pd.DataFrame) -> dict:
+    """
+    Get summary information about the dataset.
+    Args:
+        df: Input dataframe
+    Returns:
+        Dictionary containing dataset statistics
+    """
+    text_cols = get_text_columns(df)
+    label_cols = get_label_columns(df)
+    # Convert to binary labels
+    binary_labels = (df[label_cols] > 0).astype(int)
+    labels_per_issue = binary_labels.sum(axis=1)
+    issues_per_label = binary_labels.sum(axis=0)
+    info = {
+        "total_issues": len(df),
+        "total_columns": len(df.columns),
+        "text_columns": text_cols,
+        "num_text_columns": len(text_cols),
+        "label_columns": label_cols,
+        "num_labels": len(label_cols),
+        "avg_labels_per_issue": labels_per_issue.mean(),
+        "median_labels_per_issue": labels_per_issue.median(),
+        "max_labels_per_issue": labels_per_issue.max(),
+        "min_labels_per_issue": labels_per_issue.min(),
+        "avg_issues_per_label": issues_per_label.mean(),
+        "labels_with_no_issues": (issues_per_label == 0).sum(),
+    }
+    return info
+def load_data_from_db(db_path: Optional[Path] = None) -> pd.DataFrame:
+    """
+    Load data from the SQLite database.
+    Args:
+        db_path: Path to the SQLite database file.
+                 If None, uses default path in data/raw/skillscope_data.db
+    Returns:
+        DataFrame containing the nlbse_tool_competition_data_by_issue table
+    """
+    if db_path is None:
+        db_path = RAW_DATA_DIR / "skillscope_data.db"
+    conn = sqlite3.connect(db_path)
+    # Load the main table
+    query = "SELECT * FROM nlbse_tool_competition_data_by_issue"
+    df = pd.read_sql_query(query, conn)
+    conn.close()
+    print(f"Loaded {len(df)} records from database")
+    return df
+def get_text_columns(df: pd.DataFrame) -> list:
+    """
+    Identify text columns in the dataframe (typically issue title, body, etc.).
+    Args:
+        df: Input dataframe
+    Returns:
+        List of column names containing textual data
+    """
+    # Text columns from SkillScope database schema
+    # Based on exploration: issue text (title) and issue description (body)
+    text_cols = ["issue text", "issue description"]
+    return [col for col in text_cols if col in df.columns]
+def get_label_columns(df: pd.DataFrame) -> list:
+    """
+    Identify label columns (domains/subdomains with API counts).
+    Args:
+        df: Input dataframe
+    Returns:
+        List of column names containing labels
+    """
+    # Metadata columns to exclude from labels
+    # Based on exploration: these are not skill labels
+    exclude_cols = [
+        "Repo Name",
+        "PR #",
+        "issue text",
+        "issue description",
+        "created_at",
+        "author_name",
+    ]
+    # Label columns are numeric but not metadata. Use pandas is_numeric_dtype
+    # to be robust to dtype representations.
+    from pandas.api.types import is_numeric_dtype
+    label_cols = [
+        col for col in df.columns if col not in exclude_cols and is_numeric_dtype(df[col])
+    ]
+    return label_cols
+def combine_text_fields(
+    df: pd.DataFrame, text_columns: list, use_stemming: bool = True
+) -> pd.Series:
+    """
+    Combine multiple text fields into a single text representation.
+    Applies text cleaning as per SkillScope paper.
+    Args:
+        df: Input dataframe
+        text_columns: List of column names to combine
+        use_stemming: If True, apply stemming (for TF-IDF). If False, keep original words (for Embeddings).
+    Returns:
+        Series containing cleaned and combined text for each row
+    """
+    # Apply cleaning to each text column and then combine
+    combined_text = (
+        df[text_columns]
+        .fillna("")
+        .astype(str)
+        .apply(
+            lambda x: " ".join(
+                x.map(lambda text: clean_github_text(text, use_stemming=use_stemming))
+            ),
+            axis=1,
+        )
+    )
+    return combined_text
+def extract_tfidf_features(
+    df: pd.DataFrame,
+    text_columns: Optional[list] = None,
+    max_features: Optional[int] = 2000,
+    min_df: int = 2,
+    max_df: float = 0.95,
+    ngram_range: Tuple[int, int] = (1, 2),
+) -> Tuple[np.ndarray, TfidfVectorizer]:
+    """
+    Extract TF-IDF features from textual data.
+    Args:
+        df: Input dataframe
+        text_columns: List of text columns to use. If None, auto-detect.
+        max_features: Maximum number of features to extract (default: 2000 for balanced sparsity)
+        min_df: Minimum document frequency for a term to be included
+        max_df: Maximum document frequency (ignore terms appearing in >max_df of docs)
+        ngram_range: Range of n-grams to consider (e.g., (1,2) for unigrams and bigrams)
+    Returns:
+        Tuple of (feature matrix, fitted vectorizer)
+    """
+    if text_columns is None:
+        text_columns = get_text_columns(df)
+    if not text_columns:
+        raise ValueError("No text columns found in dataframe")
+    # Combine text fields (with stemming for TF-IDF)
+    print(f"Combining text from columns: {text_columns}")
+    combined_text = combine_text_fields(df, text_columns, use_stemming=True)
+    # Initialize TF-IDF vectorizer
+    vectorizer = TfidfVectorizer(
+        max_features=max_features,
+        min_df=min_df,
+        max_df=max_df,
+        ngram_range=ngram_range,
+        stop_words="english",
+        lowercase=True,
+        strip_accents="unicode",
+    )
+    # Fit and transform
+    print(
+        f"Extracting TF-IDF features with max_features={max_features if max_features else 'All'}, "
+        f"ngram_range={ngram_range}"
+    )
+    tfidf_matrix = vectorizer.fit_transform(combined_text)
+    print(
+        f"Extracted {tfidf_matrix.shape[1]} TF-IDF features from {tfidf_matrix.shape[0]} samples"
+    )
+    return tfidf_matrix.toarray(), vectorizer
+def extract_embedding_features(
+    df: pd.DataFrame,
+    text_columns: Optional[list] = None,
+    model_name: str = "all-MiniLM-L6-v2",
+    batch_size: int = 32,
+) -> Tuple[np.ndarray, object]:
+    """
+    Extract LLM embeddings from textual data using Sentence Transformers.
+    Args:
+        df: Input dataframe
+        text_columns: List of text columns to use. If None, auto-detect.
+        model_name: Name of the pre-trained model to use
+        batch_size: Batch size for encoding
+    Returns:
+        Tuple of (feature matrix, model object)
+    """
+    try:
+        from sentence_transformers import SentenceTransformer
+    except ImportError as e:
+        raise ImportError(
+            f"sentence-transformers import failed: {e}. Try running: pip install sentence-transformers"
+        ) from e
+    if text_columns is None:
+        text_columns = get_text_columns(df)
+    if not text_columns:
+        raise ValueError("No text columns found in dataframe")
+    # Combine text fields (without stemming for embeddings - LLMs need full words)
+    print(f"Combining text from columns: {text_columns}")
+    combined_text = combine_text_fields(df, text_columns, use_stemming=False)
+    # Load model
+    print(f"Loading embedding model: {model_name}")
+    model = SentenceTransformer(model_name)
+    # Encode
+    print(f"Extracting embeddings for {len(combined_text)} samples...")
+    embeddings = model.encode(
+        combined_text.tolist(),
+        batch_size=batch_size,
+        show_progress_bar=True,
+        convert_to_numpy=True,
+    )
+    print(f"Extracted embeddings shape: {embeddings.shape}")
+    return embeddings, model
+def prepare_labels(df: pd.DataFrame, label_columns: Optional[list] = None) -> pd.DataFrame:
+    """
+    Prepare multi-label binary matrix from label columns.
+    Args:
+        df: Input dataframe
+        label_columns: List of label columns. If None, auto-detect.
+    Returns:
+        DataFrame with binary labels (1 if label present, 0 otherwise)
+    """
+    if label_columns is None:
+        label_columns = get_label_columns(df)
+    # Convert to binary: any value > 0 means label is present
+    labels = (df[label_columns] > 0).astype(int)
+    print(f"Prepared {len(label_columns)} labels")
+    print(f"Label distribution:\n{labels.sum().describe()}")
+    return labels
+def create_feature_dataset(
+    db_path: Optional[Path] = None,
+    save_processed: bool = True,
+    feature_type: str = "tfidf",  # 'tfidf' or 'embedding'
+    model_name: str = "all-MiniLM-L6-v2",
+) -> Tuple[np.ndarray, pd.DataFrame, list, list]:
+    """
+    Main function to create the complete feature dataset.
+    Args:
+        db_path: Path to SQLite database
+        save_processed: Whether to save processed data to disk
+        feature_type: Type of features to extract ('tfidf' or 'embedding')
+        model_name: Model name for embeddings (ignored if feature_type='tfidf')
+    Returns:
+        Tuple of (features, labels, feature_names, label_names)
+    """
+    # Load data
+    df = load_data_from_db(db_path)
+    # Get dataset info
+    info = get_dataset_info(df)
+    print("\n=== Dataset Information ===")
+    print(f"Total issues: {info['total_issues']:,}")
+    print(f"Text columns: {info['text_columns']}")
+    print(f"Number of labels: {info['num_labels']}")
+    print(f"Avg labels per issue: {info['avg_labels_per_issue']:.2f}")
+    print(f"Labels with no issues: {info['labels_with_no_issues']}")
+    # Extract features
+    text_columns = get_text_columns(df)
+    label_columns = get_label_columns(df)
+    feature_names = []
+    vectorizer = None
+    if feature_type == "tfidf":
+        features, vectorizer = extract_tfidf_features(df, text_columns=text_columns)
+        feature_names = vectorizer.get_feature_names_out()
+    elif feature_type == "embedding":
+        features, _ = extract_embedding_features(
+            df, text_columns=text_columns, model_name=model_name
+        )
+        feature_names = [f"emb_{i}" for i in range(features.shape[1])]
+    else:
+        raise ValueError(f"Unknown feature_type: {feature_type}")
+    # Prepare labels
+    labels = prepare_labels(df, label_columns)
+    # Save processed data
+    if save_processed:
+        # Path: processed/{feature_type}/
+        output_dir = PROCESSED_DATA_DIR / feature_type
+        output_dir.mkdir(parents=True, exist_ok=True)
+        features_path = output_dir / f"features_{feature_type}.npy"
+        labels_path = output_dir / f"labels_{feature_type}.npy"
+        np.save(features_path, features)
+        np.save(labels_path, labels.values)
+        print(f"\nSaved processed data to {output_dir}")
+        print(f"  - {features_path.name}: {features.shape}")
+        print(f"  - {labels_path.name}: {labels.shape}")
+        # Save vectorizer and label names to models/ directory for inference
+        MODELS_DIR.mkdir(parents=True, exist_ok=True)
+        if feature_type == "tfidf" and vectorizer is not None:
+            vectorizer_path = MODELS_DIR / "tfidf_vectorizer.pkl"
+            joblib.dump(vectorizer, vectorizer_path)
+            print(f"  - Saved TF-IDF vectorizer to: {vectorizer_path}")
+        # Always save label names (needed for both tfidf and embedding inference)
+        label_names_path = MODELS_DIR / "label_names.pkl"
+        joblib.dump(label_columns, label_names_path)
+        print(f"  - Saved {len(label_columns)} label names to: {label_names_path}")
+    return features, labels, feature_names, label_columns
+def load_processed_data(
+    feature_name: str = "tfidf", data_dir: Optional[Path] = None
+) -> Tuple[np.ndarray, np.ndarray]:
+    """
+    Load processed features and labels from disk.
+    Args:
+        feature_name: Name prefix of the features to load (e.g., 'tfidf', 'bow', 'embeddings')
+        data_dir: Path to processed data directory. If None, uses default.
+    Returns:
+        Tuple of (features, labels)
+    """
+    if data_dir is None:
+        data_dir = PROCESSED_DATA_DIR
+    features_path = data_dir / f"features_{feature_name}.npy"
+    labels_path = data_dir / f"labels_{feature_name}.npy"
+    features = np.load(features_path)
+    labels = np.load(labels_path)
+    print(f"Loaded processed data from {data_dir}")
+    print(f"  - Feature type: {feature_name}")
+    print(f"  - Features shape: {features.shape}")
+    print(f"  - Labels shape: {labels.shape}")
+    return features, labels
+if __name__ == "__main__":
+    features, labels, feature_names, label_names = create_feature_dataset(feature_type="embedding")
+    print("\n=== Feature Extraction Summary ===")
+    print(f"Features shape: {features.shape}")
+    print(f"Labels shape: {labels.shape}")
+    print(f"Number of feature names: {len(feature_names)}")
+    print(f"Number of labels: {len(label_names)}")

hopcroft_skill_classification_tool_competition/main.py ADDED Viewed

	@@ -0,0 +1,434 @@

+"""
+FastAPI application for skill classification service.
+Provides REST API endpoints for classifying GitHub issues and pull requests
+into skill categories using machine learning models.
+Usage:
+    Development: fastapi dev hopcroft_skill_classification_tool_competition/main.py
+    Production:  fastapi run hopcroft_skill_classification_tool_competition/main.py
+Endpoints:
+    GET  /         - API information
+    GET  /health   - Health check
+    POST /predict  - Single issue classification
+    POST /predict/batch - Batch classification
+"""
+from contextlib import asynccontextmanager
+from datetime import datetime
+import json
+import os
+import time
+from typing import List
+from fastapi import FastAPI, HTTPException, status
+from fastapi.responses import JSONResponse, RedirectResponse
+import mlflow
+from pydantic import ValidationError
+from hopcroft_skill_classification_tool_competition.api_models import (
+    BatchIssueInput,
+    BatchPredictionResponse,
+    ErrorResponse,
+    HealthCheckResponse,
+    IssueInput,
+    PredictionRecord,
+    PredictionResponse,
+    SkillPrediction,
+)
+from hopcroft_skill_classification_tool_competition.config import MLFLOW_CONFIG
+from hopcroft_skill_classification_tool_competition.modeling.predict import SkillPredictor
+predictor = None
+model_version = "1.0.0"
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Manage application startup and shutdown."""
+    global predictor, model_version
+    print("=" * 80)
+    print("Starting Skill Classification API")
+    print("=" * 80)
+    # Configure MLflow
+    mlflow.set_tracking_uri(MLFLOW_CONFIG["uri"])
+    print(f"MLflow tracking URI set to: {MLFLOW_CONFIG['uri']}")
+    try:
+        model_name = os.getenv("MODEL_NAME", "random_forest_tfidf_gridsearch.pkl")
+        print(f"Loading model: {model_name}")
+        predictor = SkillPredictor(model_name=model_name)
+        print("Model and artifacts loaded successfully")
+    except Exception as e:
+        print(f"Failed to load model: {e}")
+        print("WARNING: API starting in degraded mode (prediction will fail)")
+    print(f"Model version {model_version} initialized")
+    print("API ready")
+    print("=" * 80)
+    yield
+    print("Shutting down API")
+app = FastAPI(
+    title="Skill Classification API",
+    description="API for classifying GitHub issues and pull requests into skill categories",
+    version="1.0.0",
+    docs_url="/docs",
+    redoc_url="/redoc",
+    lifespan=lifespan,
+)
+@app.get("/", tags=["Root"])
+async def root():
+    """Return basic API information."""
+    return {
+        "message": "Skill Classification API",
+        "version": "1.0.0",
+        "documentation": "/docs",
+        "demo": "/demo",
+        "health": "/health",
+    }
+@app.get("/health", response_model=HealthCheckResponse, tags=["Health"])
+async def health_check():
+    """Check API and model status."""
+    return HealthCheckResponse(
+        status="healthy",
+        model_loaded=predictor is not None,
+        version="1.0.0",
+        timestamp=datetime.now(),
+    )
+@app.get("/demo")
+async def redirect_to_demo():
+    """Redirect to Streamlit demo."""
+    return RedirectResponse(url="http://localhost:8501")
+@app.post(
+    "/predict",
+    response_model=PredictionRecord,
+    status_code=status.HTTP_201_CREATED,
+    tags=["Prediction"],
+    summary="Classify a single issue",
+    response_description="Skill predictions with confidence scores",
+)
+async def predict_skills(issue: IssueInput) -> PredictionRecord:
+    """
+    Classify a single GitHub issue or pull request into skill categories.
+    Args:
+        issue: IssueInput containing issue text and optional metadata
+    Returns:
+        PredictionRecord with list of predicted skills, confidence scores, and run_id
+    Raises:
+        HTTPException: If prediction fails
+    """
+    start_time = time.time()
+    try:
+        if predictor is None:
+            raise HTTPException(status_code=503, detail="Model not loaded")
+        # Combine text fields if needed, or just use issue_text
+        # The predictor expects a single string
+        full_text = f"{issue.issue_text} {issue.issue_description or ''} {issue.repo_name or ''}"
+        predictions_data = predictor.predict(full_text)
+        # Convert to Pydantic models
+        predictions = [
+            SkillPrediction(skill_name=p["skill_name"], confidence=p["confidence"])
+            for p in predictions_data
+        ]
+        processing_time = (time.time() - start_time) * 1000
+        # Log to MLflow
+        run_id = "local"
+        timestamp = datetime.now()
+        try:
+            experiment_name = MLFLOW_CONFIG["experiments"]["baseline"]
+            mlflow.set_experiment(experiment_name)
+            with mlflow.start_run() as run:
+                run_id = run.info.run_id
+                # Log inputs
+                mlflow.log_param("issue_text", issue.issue_text)
+                if issue.repo_name:
+                    mlflow.log_param("repo_name", issue.repo_name)
+                # Log outputs (as metrics or params/tags for retrieval)
+                # For simple retrieval, we'll store the main prediction as a tag/param
+                if predictions:
+                    mlflow.log_param("top_skill", predictions[0].skill_name)
+                    mlflow.log_metric("top_confidence", predictions[0].confidence)
+                # Store full predictions as a JSON artifact or tag
+                predictions_json = json.dumps([p.model_dump() for p in predictions])
+                mlflow.set_tag("predictions_json", predictions_json)
+                mlflow.set_tag("model_version", model_version)
+        except Exception as e:
+            print(f"MLflow logging failed: {e}")
+        return PredictionRecord(
+            predictions=predictions,
+            num_predictions=len(predictions),
+            model_version=model_version,
+            processing_time_ms=round(processing_time, 2),
+            run_id=run_id,
+            timestamp=timestamp,
+            input_text=issue.issue_text,
+        )
+    except Exception as e:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Prediction failed: {str(e)}",
+        )
+@app.post(
+    "/predict/batch",
+    response_model=BatchPredictionResponse,
+    status_code=status.HTTP_200_OK,
+    tags=["Prediction"],
+    summary="Classify multiple issues",
+    response_description="Batch skill predictions",
+)
+async def predict_skills_batch(batch: BatchIssueInput) -> BatchPredictionResponse:
+    """
+    Classify multiple GitHub issues or pull requests in batch.
+    Args:
+        batch: BatchIssueInput containing list of issues (max 100)
+    Returns:
+        BatchPredictionResponse with prediction results for each issue
+    Raises:
+        HTTPException: If batch prediction fails
+    """
+    start_time = time.time()
+    try:
+        results = []
+        if predictor is None:
+            raise HTTPException(status_code=503, detail="Model not loaded")
+        for issue in batch.issues:
+            full_text = (
+                f"{issue.issue_text} {issue.issue_description or ''} {issue.repo_name or ''}"
+            )
+            predictions_data = predictor.predict(full_text)
+            predictions = [
+                SkillPrediction(skill_name=p["skill_name"], confidence=p["confidence"])
+                for p in predictions_data
+            ]
+            results.append(
+                PredictionResponse(
+                    predictions=predictions,
+                    num_predictions=len(predictions),
+                    model_version=model_version,
+                )
+            )
+        total_processing_time = (time.time() - start_time) * 1000
+        return BatchPredictionResponse(
+            results=results,
+            total_issues=len(batch.issues),
+            total_processing_time_ms=round(total_processing_time, 2),
+        )
+    except Exception as e:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Batch prediction failed: {str(e)}",
+        )
+@app.get(
+    "/predictions/{run_id}",
+    response_model=PredictionRecord,
+    status_code=status.HTTP_200_OK,
+    tags=["Prediction"],
+    summary="Get a prediction by ID",
+    response_description="Prediction details",
+)
+async def get_prediction(run_id: str) -> PredictionRecord:
+    """
+    Retrieve a specific prediction by its MLflow Run ID.
+    Args:
+        run_id: The MLflow Run ID
+    Returns:
+        PredictionRecord containing the prediction details
+    Raises:
+        HTTPException: If run not found or error occurs
+    """
+    try:
+        run = mlflow.get_run(run_id)
+        data = run.data
+        # Reconstruct predictions from tags
+        predictions_json = data.tags.get("predictions_json", "[]")
+        predictions_data = json.loads(predictions_json)
+        predictions = [SkillPrediction(**p) for p in predictions_data]
+        # Get timestamp (start_time is in ms)
+        timestamp = datetime.fromtimestamp(run.info.start_time / 1000.0)
+        return PredictionRecord(
+            predictions=predictions,
+            num_predictions=len(predictions),
+            model_version=data.tags.get("model_version", "unknown"),
+            processing_time_ms=None,  # Not stored in standard tags, could be added
+            run_id=run.info.run_id,
+            timestamp=timestamp,
+            input_text=data.params.get("issue_text", ""),
+        )
+    except mlflow.exceptions.MlflowException:
+        raise HTTPException(
+            status_code=status.HTTP_404_NOT_FOUND, detail=f"Prediction with ID {run_id} not found"
+        )
+    except Exception as e:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Failed to retrieve prediction: {str(e)}",
+        )
+@app.get(
+    "/predictions",
+    response_model=List[PredictionRecord],
+    status_code=status.HTTP_200_OK,
+    tags=["Prediction"],
+    summary="List predictions",
+    response_description="List of recent predictions",
+)
+async def list_predictions(skip: int = 0, limit: int = 10) -> List[PredictionRecord]:
+    """
+    Retrieve a list of recent predictions.
+    Args:
+        skip: Number of records to skip (not fully supported by MLflow search, handled client-side)
+        limit: Maximum number of records to return
+    Returns:
+        List of PredictionRecord
+    """
+    try:
+        experiment_name = MLFLOW_CONFIG["experiments"]["baseline"]
+        experiment = mlflow.get_experiment_by_name(experiment_name)
+        if not experiment:
+            return []
+        # Search runs
+        runs = mlflow.search_runs(
+            experiment_ids=[experiment.experiment_id],
+            max_results=limit + skip,
+            order_by=["start_time DESC"],
+        )
+        results = []
+        # Convert pandas DataFrame to list of dicts if needed, or iterate
+        # mlflow.search_runs returns a pandas DataFrame
+        # We need to iterate through the DataFrame
+        if runs.empty:
+            return []
+        # Apply skip
+        runs = runs.iloc[skip:]
+        for _, row in runs.iterrows():
+            run_id = row.run_id
+            # Extract data from columns (flattened)
+            # Tags are prefixed with 'tags.', Params with 'params.'
+            # Helper to safely get value
+            def get_val(row, prefix, key, default=None):
+                col = f"{prefix}.{key}"
+                return row[col] if col in row else default
+            predictions_json = get_val(row, "tags", "predictions_json", "[]")
+            try:
+                predictions_data = json.loads(predictions_json)
+                predictions = [SkillPrediction(**p) for p in predictions_data]
+            except Exception:
+                predictions = []
+            timestamp = row.start_time  # This is usually a datetime object in the DF
+            # Get model_version with fallback to "unknown" or inherited default
+            model_version = get_val(row, "tags", "model_version")
+            if model_version is None or model_version == "":
+                model_version = "unknown"
+            # Get input_text with fallback to empty string
+            input_text = get_val(row, "params", "issue_text")
+            if input_text is None:
+                input_text = ""
+            results.append(
+                PredictionRecord(
+                    predictions=predictions,
+                    num_predictions=len(predictions),
+                    model_version=model_version,
+                    processing_time_ms=None,
+                    run_id=run_id,
+                    timestamp=timestamp,
+                    input_text=input_text,
+                )
+            )
+        return results
+    except Exception as e:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Failed to list predictions: {str(e)}",
+        )
+@app.exception_handler(ValidationError)
+async def validation_exception_handler(request, exc: ValidationError):
+    """Handle Pydantic validation errors."""
+    return JSONResponse(
+        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+        content=ErrorResponse(
+            error="Validation Error", detail=str(exc), timestamp=datetime.now()
+        ).model_dump(),
+    )
+@app.exception_handler(HTTPException)
+async def http_exception_handler(request, exc: HTTPException):
+    """Handle HTTP exceptions."""
+    return JSONResponse(
+        status_code=exc.status_code,
+        content=ErrorResponse(
+            error=exc.detail, detail=None, timestamp=datetime.now()
+        ).model_dump(),
+    )

hopcroft_skill_classification_tool_competition/mlsmote.py ADDED Viewed

	@@ -0,0 +1,157 @@

+# github: https://github.com/niteshsukhwani/MLSMOTE.git
+# -*- coding: utf-8 -*-
+# Importing required Library
+import random
+import numpy as np
+import pandas as pd
+from sklearn.datasets import make_classification
+from sklearn.neighbors import NearestNeighbors
+def create_dataset(n_sample=1000):
+    """
+    Create a unevenly distributed sample data set multilabel
+    classification using make_classification function
+    args
+    nsample: int, Number of sample to be created
+    return
+    X: pandas.DataFrame, feature vector dataframe with 10 features
+    y: pandas.DataFrame, target vector dataframe with 5 labels
+    """
+    X, y = make_classification(
+        n_classes=5,
+        class_sep=2,
+        weights=[0.1, 0.025, 0.205, 0.008, 0.9],
+        n_informative=3,
+        n_redundant=1,
+        flip_y=0,
+        n_features=10,
+        n_clusters_per_class=1,
+        n_samples=1000,
+        random_state=10,
+    )
+    y = pd.get_dummies(y, prefix="class")
+    return pd.DataFrame(X), y
+def get_tail_label(df):
+    """
+    Give tail label colums of the given target dataframe
+    args
+    df: pandas.DataFrame, target label df whose tail label has to identified
+    return
+    tail_label: list, a list containing column name of all the tail label
+    """
+    columns = df.columns
+    n = len(columns)
+    irpl = np.zeros(n)
+    for column in range(n):
+        irpl[column] = df[columns[column]].value_counts()[1]
+    irpl = max(irpl) / irpl
+    mir = np.average(irpl)
+    tail_label = []
+    for i in range(n):
+        if irpl[i] > mir:
+            tail_label.append(columns[i])
+    return tail_label
+def get_index(df):
+    """
+    give the index of all tail_label rows
+    args
+    df: pandas.DataFrame, target label df from which index for tail label has to identified
+    return
+    index: list, a list containing index number of all the tail label
+    """
+    tail_labels = get_tail_label(df)
+    index = set()
+    for tail_label in tail_labels:
+        sub_index = set(df[df[tail_label] == 1].index)
+        index = index.union(sub_index)
+    return list(index)
+def get_minority_instace(X, y):
+    """
+    Give minority dataframe containing all the tail labels
+    args
+    X: pandas.DataFrame, the feature vector dataframe
+    y: pandas.DataFrame, the target vector dataframe
+    return
+    X_sub: pandas.DataFrame, the feature vector minority dataframe
+    y_sub: pandas.DataFrame, the target vector minority dataframe
+    """
+    index = get_index(y)
+    X_sub = X[X.index.isin(index)].reset_index(drop=True)
+    y_sub = y[y.index.isin(index)].reset_index(drop=True)
+    return X_sub, y_sub
+def nearest_neighbour(X):
+    """
+    Give index of 5 nearest neighbor of all the instance
+    args
+    X: np.array, array whose nearest neighbor has to find
+    return
+    indices: list of list, index of 5 NN of each element in X
+    """
+    nbs = NearestNeighbors(n_neighbors=5, metric="euclidean", algorithm="kd_tree").fit(X)
+    euclidean, indices = nbs.kneighbors(X)
+    return indices
+def MLSMOTE(X, y, n_sample):
+    """
+    Give the augmented data using MLSMOTE algorithm
+    args
+    X: pandas.DataFrame, input vector DataFrame
+    y: pandas.DataFrame, feature vector dataframe
+    n_sample: int, number of newly generated sample
+    return
+    new_X: pandas.DataFrame, augmented feature vector data
+    target: pandas.DataFrame, augmented target vector data
+    """
+    indices2 = nearest_neighbour(X)
+    n = len(indices2)
+    new_X = np.zeros((n_sample, X.shape[1]))
+    target = np.zeros((n_sample, y.shape[1]))
+    for i in range(n_sample):
+        reference = random.randint(0, n - 1)
+        neighbour = random.choice(indices2[reference, 1:])
+        all_point = indices2[reference]
+        nn_df = y[y.index.isin(all_point)]
+        ser = nn_df.sum(axis=0, skipna=True)
+        target[i] = np.array([1 if val > 2 else 0 for val in ser])
+        ratio = random.random()
+        gap = X.loc[reference, :] - X.loc[neighbour, :]
+        new_X[i] = np.array(X.loc[reference, :] + ratio * gap)
+    new_X = pd.DataFrame(new_X, columns=X.columns)
+    target = pd.DataFrame(target, columns=y.columns)
+    new_X = pd.concat([X, new_X], axis=0)
+    target = pd.concat([y, target], axis=0)
+    return new_X, target
+# Keep original MLSMOTE function name for direct use
+if __name__ == "__main__":
+    """
+    main function to use the MLSMOTE
+    """
+    X, y = create_dataset()  # Creating a Dataframe
+    X_sub, y_sub = get_minority_instace(X, y)  # Getting minority instance of that datframe
+    X_res, y_res = MLSMOTE(X_sub, y_sub, 100)  # Applying MLSMOTE to augment the dataframe

hopcroft_skill_classification_tool_competition/modeling/predict.py ADDED Viewed

	@@ -0,0 +1,198 @@

+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import joblib
+import numpy as np
+from hopcroft_skill_classification_tool_competition.config import (
+    API_CONFIG,
+    DATA_PATHS,
+    EMBEDDING_MODEL_NAME,
+    MODELS_DIR,
+)
+from hopcroft_skill_classification_tool_competition.features import clean_github_text
+class SkillPredictor:
+    """
+    Skill prediction class that supports both TF-IDF and Embedding-based models.
+    The feature_type determines how text is transformed:
+    - "tfidf": Uses saved TfidfVectorizer
+    - "embedding": Uses SentenceTransformer to generate embeddings
+    """
+    def __init__(self, model_name: Optional[str] = None, feature_type: Optional[str] = None):
+        """
+        Initialize the SkillPredictor.
+        Args:
+            model_name: Name of the model file. If None, uses API_CONFIG["model_name"]
+            feature_type: "tfidf" or "embedding". If None, uses API_CONFIG["feature_type"]
+        """
+        # Use config defaults if not specified
+        self.model_name = model_name or API_CONFIG["model_name"]
+        self.feature_type = feature_type or API_CONFIG["feature_type"]
+        self.model_path = MODELS_DIR / self.model_name
+        self.labels_path = MODELS_DIR / "label_names.pkl"
+        # Paths for kept indices (may be in different locations)
+        self.kept_indices_path_models = MODELS_DIR / "kept_label_indices.npy"
+        self.kept_indices_path_tfidf = (
+            Path(DATA_PATHS["features"]).parent.parent / "tfidf" / "kept_label_indices.npy"
+        )
+        self.kept_indices_path_emb = (
+            Path(DATA_PATHS["features"]).parent.parent / "embedding" / "kept_label_indices.npy"
+        )
+        self.model = None
+        self.vectorizer = None  # TF-IDF vectorizer or SentenceTransformer
+        self.label_names = None
+        self.kept_indices = None
+        self._load_artifacts()
+    def _load_artifacts(self):
+        """Load model and required artifacts based on feature_type."""
+        print(f"Loading model from {self.model_path}...")
+        if not self.model_path.exists():
+            raise FileNotFoundError(f"Model not found at {self.model_path}")
+        self.model = joblib.load(self.model_path)
+        # Load vectorizer/encoder based on feature type
+        if self.feature_type == "tfidf":
+            self._load_tfidf_vectorizer()
+        elif self.feature_type == "embedding":
+            self._load_embedding_model()
+        else:
+            raise ValueError(
+                f"Unknown feature_type: {self.feature_type}. Must be 'tfidf' or 'embedding'"
+            )
+        # Load label names
+        print(f"Loading label names from {self.labels_path}...")
+        if not self.labels_path.exists():
+            raise FileNotFoundError(f"Label names not found at {self.labels_path}")
+        self.label_names = joblib.load(self.labels_path)
+        # Load kept indices if available
+        if self.kept_indices_path_models.exists():
+            print(f"Loading kept indices from {self.kept_indices_path_models}")
+            self.kept_indices = np.load(self.kept_indices_path_models)
+        elif self.kept_indices_path_emb.exists():
+            print(f"Loading kept indices from {self.kept_indices_path_emb}")
+            self.kept_indices = np.load(self.kept_indices_path_emb)
+        elif self.kept_indices_path_tfidf.exists():
+            print(f"Loading kept indices from {self.kept_indices_path_tfidf}")
+            self.kept_indices = np.load(self.kept_indices_path_tfidf)
+        else:
+            print("No kept_label_indices.npy found. Assuming all labels are used.")
+            self.kept_indices = None
+    def _load_tfidf_vectorizer(self):
+        """Load the TF-IDF vectorizer."""
+        vectorizer_path = MODELS_DIR / "tfidf_vectorizer.pkl"
+        print(f"Loading TF-IDF vectorizer from {vectorizer_path}...")
+        if not vectorizer_path.exists():
+            raise FileNotFoundError(
+                f"TF-IDF vectorizer not found at {vectorizer_path}. "
+                "Run feature extraction first: python -m hopcroft_skill_classification_tool_competition.features"
+            )
+        self.vectorizer = joblib.load(vectorizer_path)
+    def _load_embedding_model(self):
+        """Load the SentenceTransformer model for embeddings."""
+        try:
+            from sentence_transformers import SentenceTransformer
+        except ImportError as e:
+            raise ImportError(
+                f"sentence-transformers is required for embedding-based models. "
+                f"Install with: pip install sentence-transformers. Error: {e}"
+            ) from e
+        print(f"Loading SentenceTransformer model: {EMBEDDING_MODEL_NAME}...")
+        self.vectorizer = SentenceTransformer(EMBEDDING_MODEL_NAME)
+    def _transform_text(self, text: str) -> np.ndarray:
+        """
+        Transform text to features based on feature_type.
+        Args:
+            text: Cleaned input text
+        Returns:
+            Feature array ready for model prediction
+        """
+        if self.feature_type == "tfidf":
+            # TF-IDF: use stemming, return sparse matrix converted to array
+            cleaned = clean_github_text(text, use_stemming=True)
+            features = self.vectorizer.transform([cleaned])
+            return features
+        else:
+            # Embedding: no stemming (LLMs need full words)
+            cleaned = clean_github_text(text, use_stemming=False)
+            features = self.vectorizer.encode([cleaned], convert_to_numpy=True)
+            return features
+    def predict(self, text: str, threshold: float = 0.5) -> List[Dict[str, Any]]:
+        """
+        Predict skills for a given text.
+        Args:
+            text: Input text (issue title + body)
+            threshold: Confidence threshold for binary classification
+        Returns:
+            List of dicts with 'skill_name' and 'confidence'
+        """
+        # Transform text to features
+        features = self._transform_text(text)
+        # Predict
+        # MultiOutputClassifier predict_proba returns a list of arrays (one per class)
+        # Each array is (n_samples, 2) -> [prob_0, prob_1]
+        probas_list = self.model.predict_proba(features)
+        # Extract positive class probabilities
+        confidence_scores = []
+        for i, prob in enumerate(probas_list):
+            if prob.shape[1] >= 2:
+                confidence_scores.append(prob[0][1])
+            else:
+                # Only one class present
+                try:
+                    estimator = self.model.estimators_[i]
+                    classes = estimator.classes_
+                    if len(classes) == 1 and classes[0] == 1:
+                        confidence_scores.append(1.0)
+                    else:
+                        confidence_scores.append(0.0)
+                except Exception:
+                    confidence_scores.append(0.0)
+        confidence_scores = np.array(confidence_scores)
+        # Filter by threshold and map to label names
+        predictions = []
+        for i, score in enumerate(confidence_scores):
+            if score >= threshold:
+                if self.kept_indices is not None:
+                    if i < len(self.kept_indices):
+                        original_idx = self.kept_indices[i]
+                        skill_name = self.label_names[original_idx]
+                    else:
+                        continue
+                else:
+                    if i < len(self.label_names):
+                        skill_name = self.label_names[i]
+                    else:
+                        skill_name = f"Unknown_Skill_{i}"
+                predictions.append({"skill_name": skill_name, "confidence": float(score)})
+        # Sort by confidence descending
+        predictions.sort(key=lambda x: x["confidence"], reverse=True)
+        return predictions

hopcroft_skill_classification_tool_competition/modeling/train.py ADDED Viewed

	@@ -0,0 +1,858 @@

+import argparse
+import os
+from pathlib import Path
+from imblearn.over_sampling import ADASYN, RandomOverSampler
+import joblib
+import lightgbm as lgb
+import mlflow
+import mlflow.sklearn
+import numpy as np
+from sklearn.decomposition import PCA
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.metrics import f1_score, precision_score, recall_score
+from sklearn.model_selection import GridSearchCV, KFold, train_test_split
+from sklearn.multioutput import MultiOutputClassifier
+from hopcroft_skill_classification_tool_competition.config import (
+    ADASYN_CONFIG,
+    DATA_PATHS,
+    MLFLOW_CONFIG,
+    MODEL_CONFIG,
+    PCA_CONFIG,
+    TRAINING_CONFIG,
+    get_feature_paths,
+)
+# Local MLSMOTE implementation (lightweight multi-label oversampling)
+try:
+    import pandas as pd
+    from hopcroft_skill_classification_tool_competition.mlsmote import MLSMOTE as mlsmote_function
+    from hopcroft_skill_classification_tool_competition.mlsmote import get_minority_instace
+    _HAS_LOCAL_MLSMOTE = True
+except Exception:
+    mlsmote_function = None
+    get_minority_instace = None
+    _HAS_LOCAL_MLSMOTE = False
+    print("[warning] Local MLSMOTE not available. Check mlsmote.py exists.")
+# Prefer multilabel stratified splits for imbalanced multi-label data.
+# Use `iterative-stratification` package when available.
+try:
+    from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
+    _HAS_MLSTRAT = True
+except Exception:
+    MultilabelStratifiedShuffleSplit = None
+    _HAS_MLSTRAT = False
+# -------------------------------
+# MLflow authentication and setup
+# Load environment variables from .env file (for local dev)
+# In Docker, env vars are set via docker-compose env_file
+# -------------------------------
+from dotenv import load_dotenv
+load_dotenv()
+_mlflow_env_uri = os.getenv("MLFLOW_TRACKING_URI")
+_configured_uri = MLFLOW_CONFIG.get("uri", "https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow")
+if _mlflow_env_uri:
+    mlflow_uri = _mlflow_env_uri
+else:
+    mlflow_uri = _configured_uri
+# If targeting DagsHub, require username/password; otherwise proceed.
+if "dagshub.com" in mlflow_uri:
+    _username = os.getenv("MLFLOW_TRACKING_USERNAME")
+    _password = os.getenv("MLFLOW_TRACKING_PASSWORD")
+    if not _username or not _password:
+        raise ValueError(
+            "Set the environment variables MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD for remote tracking"
+        )
+mlflow.set_tracking_uri(mlflow_uri)
+# =====================================================
+# Common utilities (merged from train_experiments.py)
+# =====================================================
+def load_data(feature_type="tfidf", use_cleaned=True):
+    """Load features and labels using get_feature_paths.
+    Args:
+        feature_type: 'tfidf' or 'embedding'
+        use_cleaned: whether to use cleaned data
+    Returns:
+        X, Y: feature matrix and label matrix
+    """
+    paths = get_feature_paths(feature_type=feature_type, use_cleaned=use_cleaned)
+    X = np.load(paths["features"])
+    Y = np.load(paths["labels"])
+    print(f"Dataset loaded successfully: {X.shape} samples, {Y.shape} labels")
+    print(f"Using feature type: {feature_type}{'_clean' if use_cleaned else ''}")
+    return X, Y
+def stratified_train_test_split(X, Y, test_size=None, random_state=None, fallback=True):
+    """Split X, Y using multilabel stratified shuffle split when possible.
+    Args:
+        X: np.ndarray features
+        Y: np.ndarray multi-label binary matrix (n_samples, n_labels)
+        test_size: float or int, forwarded to splitter
+        random_state: int
+        fallback: if True and multilabel splitter unavailable, use sklearn.train_test_split
+    Returns:
+        X_train, X_test, Y_train, Y_test
+    """
+    if _HAS_MLSTRAT:
+        if isinstance(test_size, float):
+            tst = test_size
+        else:
+            # default to TRAINING_CONFIG if not provided
+            tst = TRAINING_CONFIG.get("test_size", 0.2)
+        msss = MultilabelStratifiedShuffleSplit(
+            n_splits=1, test_size=tst, random_state=random_state
+        )
+        train_idx, test_idx = next(msss.split(X, Y))
+        return X[train_idx], X[test_idx], Y[train_idx], Y[test_idx]
+    if fallback:
+        print(
+            "[warning] iterative-stratification not available; using standard train_test_split (no multilabel stratification). To enable stratified multilabel splitting install 'iterative-stratification'."
+        )
+        return train_test_split(X, Y, test_size=test_size, random_state=random_state, shuffle=True)
+    raise RuntimeError(
+        "iterative-stratification is required for multilabel stratified splitting but not installed."
+    )
+def stratified_train_val_test_split(
+    X, Y, test_size=0.2, val_size=0.1, random_state=None, fallback=True
+):
+    """Split X, Y into train, val, test with multilabel stratification when possible.
+    Args:
+        X, Y: arrays
+        test_size: proportion for final test set
+        val_size: proportion for validation set (relative to whole dataset)
+        random_state: seed
+        fallback: if True, falls back to sklearn splits
+    Returns:
+        X_train, X_val, X_test, Y_train, Y_val, Y_test
+    """
+    if not (0.0 < test_size < 1.0 and 0.0 <= val_size < 1.0 and val_size + test_size < 1.0):
+        raise ValueError("test_size and val_size must be fractions in (0,1) and sum < 1")
+    # First split off the final test set
+    X_rem, X_test, Y_rem, Y_test = stratified_train_test_split(
+        X, Y, test_size=test_size, random_state=random_state, fallback=fallback
+    )
+    # Compute validation size relative to the remaining data
+    rel_val = 0.0
+    if (1.0 - test_size) > 0:
+        rel_val = val_size / (1.0 - test_size)
+    else:
+        rel_val = 0.0
+    if rel_val <= 0:
+        # No validation requested
+        return X_rem, np.empty((0, X.shape[1])), X_test, Y_rem, np.empty((0, Y.shape[1])), Y_test
+    X_train, X_val, Y_train, Y_val = stratified_train_test_split(
+        X_rem, Y_rem, test_size=rel_val, random_state=random_state, fallback=fallback
+    )
+    return X_train, X_val, X_test, Y_train, Y_val, Y_test
+def _check_label_coverage(Y_train: np.ndarray, Y_val: np.ndarray, min_train: int = 1):
+    """Check that each label appears at least `min_train` times in train and
+    at least once in train+val. Prints a warning if some labels are scarce in
+    train, and raises an error if some labels are missing entirely from
+    train+val (which would make learning impossible for those labels).
+    Args:
+        Y_train: (n_train, n_labels) binary matrix
+        Y_val: (n_val, n_labels) binary matrix (may be empty)
+        min_train: minimum occurrences in train to be considered "covered"
+    """
+    # Defensive: handle empty val
+    if Y_val is None:
+        Y_val = np.empty((0, Y_train.shape[1]))
+    counts_train = np.sum(Y_train, axis=0)
+    counts_train_val = counts_train + np.sum(Y_val, axis=0)
+    missing_in_train = np.where(counts_train < min_train)[0]
+    missing_in_train_val = np.where(counts_train_val == 0)[0]
+    if missing_in_train.size > 0:
+        # Small, actionable warning for debugging
+        preview = missing_in_train[:10].tolist()
+        print(
+            f"[warning] {missing_in_train.size} label(s) have <{min_train} occurrences in TRAIN. Example label indices: {preview}."
+        )
+    if missing_in_train_val.size > 0:
+        preview = missing_in_train_val[:10].tolist()
+        raise ValueError(
+            f"{missing_in_train_val.size} label(s) have 0 occurrences in TRAIN+VAL (indices example: {preview}). "
+            "Reduce test/val size, aggregate labels, or ensure these labels exist in the source DB."
+        )
+def evaluate_and_log(model, X_test, Y_test, best_params, cv_score, exp_name, extra_params=None):
+    Y_pred = model.predict(X_test)
+    precision = precision_score(Y_test, Y_pred, average="micro", zero_division=0)
+    recall = recall_score(Y_test, Y_pred, average="micro", zero_division=0)
+    f1 = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
+    mlflow.log_metrics(
+        {
+            "cv_best_f1_micro": cv_score,
+            "test_precision_micro": precision,
+            "test_recall_micro": recall,
+            "test_f1_micro": f1,
+        }
+    )
+    for k, v in best_params.items():
+        mlflow.log_param(k, v)
+    if extra_params:
+        for k, v in extra_params.items():
+            mlflow.log_param(k, v)
+    os.makedirs(DATA_PATHS["models_dir"], exist_ok=True)
+    model_path = Path(DATA_PATHS["models_dir"]) / f"{exp_name}.pkl"
+    joblib.dump(model, model_path)
+    mlflow.log_artifact(str(model_path), artifact_path=f"model_{exp_name}")
+    print(f"Model saved to {model_path}")
+    print(f"{exp_name} completed and logged successfully.\n")
+def run_grid_search(X, Y):
+    base_rf = RandomForestClassifier(random_state=TRAINING_CONFIG["random_state"], n_jobs=-1)
+    multi = MultiOutputClassifier(base_rf)
+    cv = KFold(
+        n_splits=TRAINING_CONFIG["cv_folds"],
+        shuffle=True,
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    grid = GridSearchCV(
+        estimator=multi,
+        param_grid=MODEL_CONFIG["param_grid"],
+        scoring="f1_micro",
+        cv=cv,
+        n_jobs=-1,
+        verbose=2,
+        refit=True,
+    )
+    return grid
+def run_grid_search_lgb(X, Y):
+    base_lgb = lgb.LGBMClassifier(
+        random_state=TRAINING_CONFIG["random_state"], n_jobs=1, force_row_wise=True, verbose=-1
+    )
+    multi = MultiOutputClassifier(base_lgb, n_jobs=-1)
+    cv = KFold(
+        n_splits=TRAINING_CONFIG["cv_folds"],
+        shuffle=True,
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    lgb_param_grid = {
+        "estimator__n_estimators": [50, 100, 200],
+        "estimator__max_depth": [3, 5, 7],
+        "estimator__learning_rate": [0.1],
+        "estimator__num_leaves": [15],
+    }
+    grid = GridSearchCV(
+        estimator=multi,
+        param_grid=lgb_param_grid,
+        scoring="f1_micro",
+        cv=cv,
+        n_jobs=-1,
+        verbose=2,
+        refit=True,
+    )
+    return grid
+# =====================================================
+# Experiments (merged)
+# =====================================================
+def run_smote_experiment(X, Y, feature_type="tfidf"):
+    mlflow.set_experiment(MLFLOW_CONFIG["experiments"]["smote"])
+    # Split into train / val / test
+    X_train, X_val, X_test, Y_train, Y_val, Y_test = stratified_train_val_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        val_size=TRAINING_CONFIG.get("val_size", 0.1),
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    # Check label coverage and fail early if labels are missing from train+val
+    _check_label_coverage(Y_train, Y_val)
+    # Apply MLSMOTE (Multi-Label SMOTE) as per paper
+    # MLSMOTE handles multi-label classification natively by considering label correlations
+    print("Applying MLSMOTE (Multi-Label SMOTE) as per SkillScope paper...")
+    print(f"   Original training set: {X_train.shape[0]} samples, {Y_train.shape[1]} labels")
+    # Use local MLSMOTE implementation directly (function-based)
+    if _HAS_LOCAL_MLSMOTE:
+        try:
+            # Set random seed
+            if TRAINING_CONFIG["random_state"] is not None:
+                np.random.seed(TRAINING_CONFIG["random_state"])
+                import random
+                random.seed(TRAINING_CONFIG["random_state"])
+            # Convert to DataFrame (MLSMOTE function expects DataFrames)
+            X_train_df = pd.DataFrame(X_train)
+            Y_train_df = pd.DataFrame(Y_train)
+            # Get minority instances
+            X_min, Y_min = get_minority_instace(X_train_df, Y_train_df)
+            if len(X_min) == 0:
+                print("No minority instances found, using original dataset")
+                X_res, Y_res = X_train, Y_train
+                oversampling_method = "None (no minority instances)"
+                n_new = 0
+            else:
+                # Calculate number of synthetic samples
+                label_counts = Y_train_df.sum(axis=0)
+                mean_count = int(label_counts.mean())
+                min_count = int(label_counts.min())
+                n_synthetic = max(100, int(mean_count - min_count))
+                n_synthetic = min(n_synthetic, len(X_min) * 3)
+                print(
+                    f"Generating {n_synthetic} synthetic samples from {len(X_min)} minority instances"
+                )
+                # Apply MLSMOTE function directly
+                X_res_df, Y_res_df = mlsmote_function(X_min, Y_min, n_synthetic)
+                # Convert back to numpy
+                X_res = X_res_df.values
+                Y_res = Y_res_df.values.astype(int)
+                oversampling_method = "MLSMOTE (local implementation)"
+                n_new = len(X_res) - len(X_train)
+                print(
+                    f"MLSMOTE completed: {n_new} synthetic samples generated. Total: {len(X_res)} samples"
+                )
+        except Exception as e:
+            print(f"MLSMOTE failed ({e}); falling back to RandomOverSampler")
+            Y_train_str = ["".join(map(str, y)) for y in Y_train]
+            ros = RandomOverSampler(random_state=TRAINING_CONFIG["random_state"])
+            X_res, Y_res_str = ros.fit_resample(X_train, Y_train_str)
+            Y_res = np.array([[int(c) for c in s] for s in Y_res_str])
+            oversampling_method = "RandomOverSampler (MLSMOTE fallback)"
+            n_new = len(X_res) - len(X_train)
+    else:
+        print("Local MLSMOTE not available; falling back to RandomOverSampler")
+        Y_train_str = ["".join(map(str, y)) for y in Y_train]
+        ros = RandomOverSampler(random_state=TRAINING_CONFIG["random_state"])
+        X_res, Y_res_str = ros.fit_resample(X_train, Y_train_str)
+        Y_res = np.array([[int(c) for c in s] for s in Y_res_str])
+        oversampling_method = "RandomOverSampler (no MLSMOTE)"
+        n_new = len(X_res) - len(X_train)
+    grid = run_grid_search(X_res, Y_res)
+    with mlflow.start_run(run_name="random_forest_with_smote"):
+        grid.fit(X_res, Y_res)
+        # Refit final model on train + val (use original non-oversampled data for final fit)
+        best_params = grid.best_params_
+        best_cv = grid.best_score_
+        final_model = grid.best_estimator_
+        X_comb = np.vstack([X_train, X_val]) if X_val.size else X_train
+        Y_comb = np.vstack([Y_train, Y_val]) if Y_val.size else Y_train
+        final_model.fit(X_comb, Y_comb)
+        evaluate_and_log(
+            final_model,
+            X_test,
+            Y_test,
+            best_params,
+            best_cv,
+            f"random_forest_{feature_type}_gridsearch_smote",
+            {
+                "oversampling": oversampling_method,
+                "synthetic_samples": n_new,
+                "n_labels": Y_train.shape[1],
+            },
+        )
+def run_ros_experiment(X, Y):
+    mlflow.set_experiment(MLFLOW_CONFIG["experiments"]["ros"])
+    # Split into train / val / test
+    X_train, X_val, X_test, Y_train, Y_val, Y_test = stratified_train_val_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        val_size=TRAINING_CONFIG.get("val_size", 0.1),
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    Y_train_str = ["".join(map(str, y)) for y in Y_train]
+    ros = RandomOverSampler(random_state=TRAINING_CONFIG["random_state"])
+    X_res, Y_res_str = ros.fit_resample(X_train, Y_train_str)
+    Y.shape[1]
+    Y_res = np.array([[int(c) for c in s] for s in Y_res_str])
+    grid = run_grid_search(X_res, Y_res)
+    with mlflow.start_run(run_name="random_forest_with_ros"):
+        grid.fit(X_res, Y_res)
+        best_params = grid.best_params_
+        best_cv = grid.best_score_
+        final_model = grid.best_estimator_
+        X_comb = np.vstack([X_train, X_val]) if X_val.size else X_train
+        Y_comb = np.vstack([Y_train, Y_val]) if Y_val.size else Y_train
+        final_model.fit(X_comb, Y_comb)
+        evaluate_and_log(
+            final_model,
+            X_test,
+            Y_test,
+            best_params,
+            best_cv,
+            "random_forest_tfidf_gridsearch_ros",
+            {"oversampling": "RandomOverSampler"},
+        )
+def run_adasyn_pca_experiment(X, Y):
+    mlflow.set_experiment(MLFLOW_CONFIG["experiments"]["adasyn_pca"])
+    # Split into train / val / test
+    X_train, X_val, X_test, Y_train, Y_val, Y_test = stratified_train_val_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        val_size=TRAINING_CONFIG.get("val_size", 0.1),
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    print("Applying PCA before ADASYN...")
+    pca = PCA(
+        n_components=PCA_CONFIG["variance_retained"], random_state=TRAINING_CONFIG["random_state"]
+    )
+    X_train_pca = pca.fit_transform(X_train)
+    adasyn = ADASYN(
+        random_state=TRAINING_CONFIG["random_state"],
+        n_neighbors=ADASYN_CONFIG["n_neighbors"],
+        sampling_strategy=ADASYN_CONFIG["sampling_strategy"],
+    )
+    valid_label_idx = next(
+        (i for i in range(Y_train.shape[1]) if len(np.unique(Y_train[:, i])) > 1), None
+    )
+    if valid_label_idx is None:
+        X_res, Y_res = X_train, Y_train
+        n_new = 0
+    else:
+        X_res_pca, _ = adasyn.fit_resample(X_train_pca, Y_train[:, valid_label_idx])
+        X_res = pca.inverse_transform(X_res_pca)
+        n_new = len(X_res) - len(X_train)
+        Y_res = np.vstack([Y_train, Y_train[np.random.randint(0, len(Y_train), n_new)]])
+    grid = run_grid_search(X_res, Y_res)
+    with mlflow.start_run(run_name="random_forest_with_adasyn_pca"):
+        grid.fit(X_res, Y_res)
+        best_params = grid.best_params_
+        best_cv = grid.best_score_
+        final_model = grid.best_estimator_
+        X_comb = np.vstack([X_train, X_val]) if X_val.size else X_train
+        Y_comb = np.vstack([Y_train, Y_val]) if Y_val.size else Y_train
+        final_model.fit(X_comb, Y_comb)
+        evaluate_and_log(
+            final_model,
+            X_test,
+            Y_test,
+            best_params,
+            best_cv,
+            "random_forest_tfidf_gridsearch_adasyn_pca",
+            {
+                "oversampling": "ADASYN + PCA",
+                "pca_variance": PCA_CONFIG["variance_retained"],
+                "synthetic_samples": n_new,
+            },
+        )
+        pca_path = Path(DATA_PATHS["models_dir"]) / "pca_tfidf_adasyn.pkl"
+        joblib.dump(pca, pca_path)
+        mlflow.log_artifact(str(pca_path), artifact_path="model_adasyn_pca")
+def run_lightgbm(X, Y):
+    mlflow.set_experiment(MLFLOW_CONFIG["experiments"].get("lightgbm", "LightGBM"))
+    # Split into train / val / test
+    X_train, X_val, X_test, Y_train, Y_val, Y_test = stratified_train_val_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        val_size=TRAINING_CONFIG.get("val_size", 0.1),
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    print("\nTraining LightGBM with GridSearchCV...")
+    grid = run_grid_search_lgb(X_train, Y_train)
+    with mlflow.start_run(run_name="lightgbm"):
+        grid.fit(X_train, Y_train)
+        best_params = grid.best_params_
+        best_cv = grid.best_score_
+        final_model = grid.best_estimator_
+        X_comb = np.vstack([X_train, X_val]) if X_val.size else X_train
+        Y_comb = np.vstack([Y_train, Y_val]) if Y_val.size else Y_train
+        final_model.fit(X_comb, Y_comb)
+        evaluate_and_log(
+            final_model,
+            X_test,
+            Y_test,
+            best_params,
+            best_cv,
+            "lightgbm_tfidf_gridsearch",
+            {"oversampling": "None", "model": "LightGBM"},
+        )
+def run_lightgbm_smote_experiment(X, Y):
+    mlflow.set_experiment(MLFLOW_CONFIG["experiments"].get("lightgbm_smote", "LightGBM_SMOTE"))
+    # Split into train / val / test
+    X_train, X_val, X_test, Y_train, Y_val, Y_test = stratified_train_val_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        val_size=TRAINING_CONFIG.get("val_size", 0.1),
+        random_state=TRAINING_CONFIG["random_state"],
+    )
+    # Apply MLSMOTE (Multi-Label SMOTE) as per paper
+    print(" Applying MLSMOTE for LightGBM...")
+    print(f"   Original training set: {X_train.shape[0]} samples, {Y_train.shape[1]} labels")
+    # Use local MLSMOTE implementation directly (function-based)
+    if _HAS_LOCAL_MLSMOTE:
+        try:
+            # Set random seed
+            if TRAINING_CONFIG["random_state"] is not None:
+                np.random.seed(TRAINING_CONFIG["random_state"])
+                import random
+                random.seed(TRAINING_CONFIG["random_state"])
+            # Convert to DataFrame (MLSMOTE function expects DataFrames)
+            X_train_df = pd.DataFrame(X_train)
+            Y_train_df = pd.DataFrame(Y_train)
+            # Get minority instances
+            X_min, Y_min = get_minority_instace(X_train_df, Y_train_df)
+            if len(X_min) == 0:
+                print("No minority instances found, using original dataset")
+                X_res, Y_res = X_train, Y_train
+                oversampling_method = "None (no minority instances)"
+                n_new = 0
+            else:
+                # Calculate number of synthetic samples
+                label_counts = Y_train_df.sum(axis=0)
+                mean_count = int(label_counts.mean())
+                min_count = int(label_counts.min())
+                n_synthetic = max(100, int(mean_count - min_count))
+                n_synthetic = min(n_synthetic, len(X_min) * 3)
+                print(
+                    f"Generating {n_synthetic} synthetic samples from {len(X_min)} minority instances"
+                )
+                # Apply MLSMOTE function directly
+                X_res_df, Y_res_df = mlsmote_function(X_min, Y_min, n_synthetic)
+                # Convert back to numpy
+                X_res = X_res_df.values
+                Y_res = Y_res_df.values.astype(int)
+                oversampling_method = "MLSMOTE (local implementation)"
+                n_new = len(X_res) - len(X_train)
+                print(
+                    f"MLSMOTE completed: {n_new} synthetic samples generated. Total: {len(X_res)} samples"
+                )
+        except Exception as e:
+            print(f"MLSMOTE failed ({e}); falling back to RandomOverSampler")
+            Y_train_str = ["".join(map(str, y)) for y in Y_train]
+            ros = RandomOverSampler(random_state=TRAINING_CONFIG["random_state"])
+            X_res, Y_res_str = ros.fit_resample(X_train, Y_train_str)
+            Y_res = np.array([[int(c) for c in s] for s in Y_res_str])
+            oversampling_method = "RandomOverSampler (MLSMOTE fallback)"
+            n_new = len(X_res) - len(X_train)
+    else:
+        print("  Local MLSMOTE not available; falling back to RandomOverSampler")
+        Y_train_str = ["".join(map(str, y)) for y in Y_train]
+        ros = RandomOverSampler(random_state=TRAINING_CONFIG["random_state"])
+        X_res, Y_res_str = ros.fit_resample(X_train, Y_train_str)
+        Y_res = np.array([[int(c) for c in s] for s in Y_res_str])
+        oversampling_method = "RandomOverSampler (no MLSMOTE)"
+        n_new = len(X_res) - len(X_train)
+    print(f"\n Training LightGBM with {oversampling_method} ({n_new} synthetic samples)...")
+    grid = run_grid_search_lgb(X_res, Y_res)
+    with mlflow.start_run(run_name="lightgbm_with_smote"):
+        grid.fit(X_res, Y_res)
+        best_params = grid.best_params_
+        best_cv = grid.best_score_
+        final_model = grid.best_estimator_
+        X_comb = np.vstack([X_train, X_val]) if X_val.size else X_train
+        Y_comb = np.vstack([Y_train, Y_val]) if Y_val.size else Y_train
+        final_model.fit(X_comb, Y_comb)
+        evaluate_and_log(
+            final_model,
+            X_test,
+            Y_test,
+            best_params,
+            best_cv,
+            "lightgbm_tfidf_gridsearch_smote",
+            {
+                "oversampling": oversampling_method,
+                "synthetic_samples": n_new,
+                "n_labels": Y_train.shape[1],
+                "model": "LightGBM",
+            },
+        )
+# =====================================================
+# Baseline training (original train.py behavior)
+# =====================================================
+def run_baseline_train(feature_type="tfidf", use_cleaned=True):
+    """Run baseline training with configurable feature type.
+    Args:
+        feature_type: 'tfidf' or 'embedding'
+        use_cleaned: whether to use cleaned data
+    """
+    mlflow.set_experiment(
+        MLFLOW_CONFIG.get("experiments", {}).get("baseline", "hopcroft_random_forest_baseline")
+    )
+    X, Y = load_data(feature_type=feature_type, use_cleaned=use_cleaned)
+    # Use 80/20 split as per SkillScope paper (no validation set for baseline)
+    print(" Using 80/20 train/test split as per paper...")
+    X_train, X_test, Y_train, Y_test = stratified_train_test_split(
+        X,
+        Y,
+        test_size=TRAINING_CONFIG.get("test_size", 0.2),
+        random_state=TRAINING_CONFIG.get("random_state", 42),
+    )
+    # Remove labels that have 0 occurrences in training set (after split)
+    train_counts = np.sum(Y_train, axis=0).astype(int)
+    zero_in_train = np.where(train_counts == 0)[0]
+    if zero_in_train.size > 0:
+        kept_idx = np.where(train_counts > 0)[0]
+        print(
+            f"[warning] Removing {zero_in_train.size} label(s) with 0 occurrences in TRAIN set. Example removed indices: {zero_in_train[:10].tolist()}"
+        )
+        Y_train = Y_train[:, kept_idx]
+        Y_test = Y_test[:, kept_idx]
+        # Save kept indices for inference
+        paths = get_feature_paths(feature_type=feature_type, use_cleaned=use_cleaned)
+        kept_indices_path = Path(paths["features"]).parent / "kept_label_indices.npy"
+        np.save(kept_indices_path, kept_idx)
+        print(f"Saved kept label indices to {kept_indices_path}")
+    # Now check label coverage (should pass since we removed zero-occurrence labels)
+    _check_label_coverage(Y_train, np.empty((0, Y_train.shape[1])))
+    base_rf = RandomForestClassifier(
+        random_state=TRAINING_CONFIG.get("random_state", 42), n_jobs=-1
+    )
+    multi = MultiOutputClassifier(base_rf)
+    # Use full param_grid from MODEL_CONFIG for optimal results as per paper
+    param_grid = MODEL_CONFIG.get(
+        "param_grid",
+        {
+            "estimator__n_estimators": [50, 100, 200],
+            "estimator__max_depth": [10, 20, 30],
+            "estimator__min_samples_split": [2, 5],
+        },
+    )
+    cv = KFold(
+        n_splits=TRAINING_CONFIG.get("cv_folds", 5),
+        shuffle=True,
+        random_state=TRAINING_CONFIG.get("random_state", 42),
+    )
+    print(
+        f" GridSearch with {cv.n_splits} folds and {len(param_grid['estimator__n_estimators']) * len(param_grid['estimator__max_depth']) * len(param_grid['estimator__min_samples_split'])} combinations..."
+    )
+    grid = GridSearchCV(
+        estimator=multi,
+        param_grid=param_grid,
+        scoring="f1_micro",
+        cv=cv,
+        n_jobs=-1,
+        verbose=2,
+        refit=True,
+    )
+    with mlflow.start_run(run_name="random_forest_tfidf_gridsearch"):
+        grid.fit(X_train, Y_train)
+        best = grid.best_estimator_
+        best_params = grid.best_params_
+        best_cv_score = grid.best_score_
+        # No need to refit on combined train+val since we don't have a val set
+        # Model is already fitted on full training data
+        Y_pred_test = best.predict(X_test)
+        precision = precision_score(Y_test, Y_pred_test, average="micro", zero_division=0)
+        recall = recall_score(Y_test, Y_pred_test, average="micro", zero_division=0)
+        f1 = f1_score(Y_test, Y_pred_test, average="micro", zero_division=0)
+        mlflow.log_param("model_type", "RandomForest + MultiOutput")
+        for k, v in best_params.items():
+            mlflow.log_param(k, v)
+        mlflow.log_metric("cv_best_f1_micro", best_cv_score)
+        mlflow.log_metric("test_precision_micro", precision)
+        mlflow.log_metric("test_recall_micro", recall)
+        mlflow.log_metric("test_f1_micro", f1)
+        mlflow.log_param("feature_type", feature_type)
+        mlflow.log_param("use_cleaned", use_cleaned)
+        print("\n=== Training Results ===")
+        print(f"Test Precision (Micro): {precision:.4f}")
+        print(f"Test Recall (Micro):    {recall:.4f}")
+        print(f"Test F1 Score (Micro):  {f1:.4f}")
+        print("========================\n")
+        paths = get_feature_paths(feature_type=feature_type, use_cleaned=use_cleaned)
+        os.makedirs(paths["models_dir"], exist_ok=True)
+        model_path = Path(paths["models_dir"]) / f"random_forest_{feature_type}_gridsearch.pkl"
+        joblib.dump(best, model_path)
+        np.save(Path(paths["features"]).parent / "X_test.npy", X_test)
+        np.save(Path(paths["labels"]).parent / "Y_test.npy", Y_test)
+        mlflow.sklearn.log_model(best, "model")
+    print("Grid search training completed and logged successfully.")
+# =====================================================
+# Inference utility (merged from predict.py)
+# =====================================================
+def run_inference(model_path: str = None):
+    mlflow.set_experiment(
+        MLFLOW_CONFIG.get("experiments", {}).get("inference", "hopcroft_random_forest_inference")
+    )
+    if model_path is None:
+        model_path = Path(DATA_PATHS["models_dir"]) / "random_forest_tfidf_gridsearch.pkl"
+    else:
+        model_path = Path(model_path)
+    model = joblib.load(str(model_path))
+    X_test = np.load(Path(DATA_PATHS["features"]).parent / "X_test.npy")
+    Y_test = np.load(Path(DATA_PATHS["labels"]).parent / "Y_test.npy")
+    with mlflow.start_run(run_name="random_forest_tfidf_inference"):
+        Y_pred = model.predict(X_test)
+        precision = precision_score(Y_test, Y_pred, average="micro", zero_division=0)
+        recall = recall_score(Y_test, Y_pred, average="micro", zero_division=0)
+        f1 = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
+        mlflow.log_metric("test_precision_micro", precision)
+        mlflow.log_metric("test_recall_micro", recall)
+        mlflow.log_metric("test_f1_micro", f1)
+    print(f"Inference completed — Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")
+def _parse_args():
+    p = argparse.ArgumentParser(description="Unified training & experiments script")
+    p.add_argument(
+        "action",
+        choices=[
+            "baseline",
+            "smote",
+            "ros",
+            "adasyn_pca",
+            "lightgbm",
+            "lightgbm_smote",
+            "predict",
+        ],
+        help="Action to run",
+    )
+    p.add_argument("--model-path", help="Custom model path for inference")
+    return p.parse_args()
+if __name__ == "__main__":
+    args = _parse_args()
+    # Baseline has its own load_data logic (removes rare labels after split)
+    if args.action == "baseline":
+        run_baseline_train(feature_type="tfidf", use_cleaned=True)
+    else:
+        # Other experiments use the original load_data() logic
+        X, Y = load_data(feature_type="tfidf", use_cleaned=True)
+        if args.action == "smote":
+            run_smote_experiment(X, Y)
+        elif args.action == "ros":
+            run_ros_experiment(X, Y)
+        elif args.action == "adasyn_pca":
+            run_adasyn_pca_experiment(X, Y)
+        elif args.action == "lightgbm":
+            run_lightgbm(X, Y)
+        elif args.action == "lightgbm_smote":
+            run_lightgbm_smote_experiment(X, Y)
+        elif args.action == "predict":
+            run_inference(args.model_path)

hopcroft_skill_classification_tool_competition/streamlit_app.py ADDED Viewed

	@@ -0,0 +1,322 @@

+import os
+from typing import Dict, List
+import pandas as pd
+import requests
+import streamlit as st
+API_BASE_URL = os.getenv("API_BASE_URL", "http://localhost:8000")
+# Page config
+st.set_page_config(
+    page_title="GitHub Skill Classifier", layout="wide", initial_sidebar_state="expanded"
+)
+st.markdown(
+    """
+<style>
+    .main-header {
+        font-size: 2.5rem;
+        color: #1f77b4;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .skill-card {
+        padding: 1rem;
+        border-radius: 0.5rem;
+        border-left: 4px solid #1f77b4;
+        background-color: #f0f2f6;
+        margin-bottom: 0.5rem;
+    }
+    .confidence-high {
+        color: #28a745;
+        font-weight: bold;
+    }
+    .confidence-medium {
+        color: #ffc107;
+        font-weight: bold;
+    }
+    .confidence-low {
+        color: #dc3545;
+        font-weight: bold;
+    }
+</style>
+""",
+    unsafe_allow_html=True,
+)
+def check_api_health() -> bool:
+    """Check if the API is running and healthy."""
+    try:
+        response = requests.get(f"{API_BASE_URL}/health", timeout=2)
+        return response.status_code == 200
+    except Exception:
+        return False
+def predict_skills(
+    issue_text: str, issue_description: str = None, repo_name: str = None, pr_number: int = None
+) -> Dict:
+    """Call the prediction API."""
+    payload = {"issue_text": issue_text}
+    if issue_description:
+        payload["issue_description"] = issue_description
+    if repo_name:
+        payload["repo_name"] = repo_name
+    if pr_number:
+        payload["pr_number"] = pr_number
+    try:
+        response = requests.post(f"{API_BASE_URL}/predict", json=payload, timeout=30)
+        response.raise_for_status()
+        return response.json()
+    except requests.exceptions.RequestException as e:
+        st.error(f"API Error: {str(e)}")
+        return None
+def display_predictions(predictions: List[Dict], threshold: float = 0.5):
+    """Display predictions with visual formatting."""
+    # Filter by threshold
+    filtered = [p for p in predictions if p["confidence"] >= threshold]
+    if not filtered:
+        st.warning(f"No predictions above confidence threshold {threshold:.2f}")
+        return
+    st.success(f"Found {len(filtered)} skills above threshold {threshold:.2f}")
+    # Create DataFrame for table view
+    df = pd.DataFrame(filtered)
+    df["confidence"] = df["confidence"].apply(lambda x: f"{x:.2%}")
+    col1, col2 = st.columns([2, 1])
+    with col1:
+        st.subheader("Predictions Table")
+        st.dataframe(
+            df,
+            use_container_width=True,
+            hide_index=True,
+            column_config={
+                "skill_name": st.column_config.TextColumn("Skill", width="large"),
+                "confidence": st.column_config.TextColumn("Confidence", width="medium"),
+            },
+        )
+    with col2:
+        st.subheader("Top 5 Skills")
+        for i, pred in enumerate(filtered[:5], 1):
+            confidence = pred["confidence"]
+            if confidence >= 0.8:
+                conf_class = "confidence-high"
+            elif confidence >= 0.5:
+                conf_class = "confidence-medium"
+            else:
+                conf_class = "confidence-low"
+            st.markdown(
+                f"""
+            <div class="skill-card">
+                <strong>#{i} {pred["skill_name"]}</strong><br>
+                <span class="{conf_class}">{confidence:.2%}</span>
+            </div>
+            """,
+                unsafe_allow_html=True,
+            )
+def main():
+    """Main Streamlit app."""
+    if "example_text" not in st.session_state:
+        st.session_state.example_text = ""
+    # Header
+    st.markdown('<h1 class="main-header"> GitHub Skill Classifier</h1>', unsafe_allow_html=True)
+    st.markdown("""
+    This tool uses machine learning to predict the skills required for GitHub issues and pull requests.
+    Enter the issue text below to get started!
+    """)
+    # Sidebar
+    with st.sidebar:
+        st.header("Settings")
+        # API Status
+        st.subheader("API Status")
+        if check_api_health():
+            st.success(" API is running")
+        else:
+            st.error(" API is not available")
+            st.info(f"Make sure FastAPI is running at {API_BASE_URL}")
+            st.code("fastapi dev hopcroft_skill_classification_tool_competition/main.py")
+        # Confidence threshold
+        threshold = st.slider(
+            "Confidence Threshold",
+            min_value=0.0,
+            max_value=1.0,
+            value=0.5,
+            step=0.05,
+            help="Only show predictions above this confidence level",
+        )
+        # Model info
+        st.subheader("Model Info")
+        try:
+            health = requests.get(f"{API_BASE_URL}/health", timeout=2).json()
+            st.metric("Version", health.get("version", "N/A"))
+            st.metric("Model Loaded", "" if health.get("model_loaded") else "")
+        except Exception:
+            st.info("API not available")
+    # Main
+    st.header("Input")
+    # Tabs for different input modes
+    tab1, tab2, tab3 = st.tabs(["Quick Input", "Detailed Input", "Examples"])
+    with tab1:
+        issue_text = st.text_area(
+            "Issue/PR Text",
+            height=150,
+            placeholder="Enter the issue or pull request text here...",
+            help="Required: The main text of the GitHub issue or PR",
+            value=st.session_state.example_text,
+        )
+        if st.button("Predict Skills", type="primary", use_container_width=True):
+            if not issue_text.strip():
+                st.error("Please enter some text!")
+            else:
+                st.session_state.example_text = ""
+                with st.spinner("Analyzing issue..."):
+                    result = predict_skills(issue_text)
+                    if result:
+                        st.header("Results")
+                        # Metadata
+                        col1, col2, col3 = st.columns(3)
+                        with col1:
+                            st.metric("Total Predictions", result.get("num_predictions", 0))
+                        with col2:
+                            st.metric(
+                                "Processing Time", f"{result.get('processing_time_ms', 0):.2f} ms"
+                            )
+                        with col3:
+                            st.metric("Model Version", result.get("model_version", "N/A"))
+                        # Predictions
+                        st.divider()
+                        display_predictions(result.get("predictions", []), threshold)
+                        # Raw JSON
+                        with st.expander("🔍 View Raw Response"):
+                            st.json(result)
+    with tab2:
+        col1, col2 = st.columns(2)
+        with col1:
+            issue_text_detailed = st.text_area(
+                "Issue Title/Text*",
+                height=100,
+                placeholder="e.g., Fix authentication bug in login module",
+                key="issue_text_detailed",
+            )
+            issue_description = st.text_area(
+                "Issue Description",
+                height=100,
+                placeholder="Optional: Detailed description of the issue",
+                key="issue_description",
+            )
+        with col2:
+            repo_name = st.text_input(
+                "Repository Name",
+                placeholder="e.g., owner/repository",
+                help="Optional: GitHub repository name",
+            )
+            pr_number = st.number_input(
+                "PR Number",
+                min_value=0,
+                value=0,
+                help="Optional: Pull request number (0 = not a PR)",
+            )
+        if st.button("Predict Skills (Detailed)", type="primary", use_container_width=True):
+            if not issue_text_detailed.strip():
+                st.error("Issue text is required!")
+            else:
+                with st.spinner("Analyzing issue..."):
+                    result = predict_skills(
+                        issue_text_detailed,
+                        issue_description if issue_description else None,
+                        repo_name if repo_name else None,
+                        pr_number if pr_number > 0 else None,
+                    )
+                    if result:
+                        st.header("Results")
+                        # Metadata
+                        col1, col2, col3 = st.columns(3)
+                        with col1:
+                            st.metric("Total Predictions", result.get("num_predictions", 0))
+                        with col2:
+                            st.metric(
+                                "Processing Time", f"{result.get('processing_time_ms', 0):.2f} ms"
+                            )
+                        with col3:
+                            st.metric("Model Version", result.get("model_version", "N/A"))
+                        st.divider()
+                        display_predictions(result.get("predictions", []), threshold)
+                        with st.expander("🔍 View Raw Response"):
+                            st.json(result)
+    with tab3:
+        st.markdown("### Example Issues")
+        examples = [
+            {
+                "title": "Authentication Bug",
+                "text": "Fix authentication bug in login module. Users cannot login with OAuth providers.",
+            },
+            {
+                "title": "Machine Learning Feature",
+                "text": "Implement transfer learning with transformers for text classification using PyTorch and TensorFlow.",
+            },
+            {
+                "title": "Database Issue",
+                "text": "Fix database connection pooling issue causing memory leaks in production environment.",
+            },
+            {
+                "title": "UI Enhancement",
+                "text": "Add responsive design support for mobile devices with CSS media queries and flexbox layout.",
+            },
+        ]
+        for i, example in enumerate(examples):
+            if st.button(example["title"], use_container_width=True, key=f"example_btn_{i}"):
+                st.session_state.example_text = example["text"]
+                st.rerun()
+        if st.session_state.example_text:
+            st.success(" Example loaded! Switch to 'Quick Input' tab to use it.")
+            with st.expander("Preview"):
+                st.code(st.session_state.example_text)
+if __name__ == "__main__":
+    main()

hopcroft_skill_classification_tool_competition/threshold_optimization.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""
+Threshold Optimization for Multi-Label Classification
+This module provides functions to optimize decision thresholds for multi-label
+classification tasks to maximize F1-score (or other metrics).
+In multi-label classification, the default threshold of 0.5 for converting
+probabilities to binary predictions is often suboptimal, especially for
+imbalanced classes. This module finds optimal thresholds per-class or globally.
+Designed to work with Random Forest (baseline and improved models).
+Usage:
+    from threshold_optimization import optimize_thresholds, apply_thresholds
+    from sklearn.ensemble import RandomForestClassifier
+    # Train Random Forest
+    model = RandomForestClassifier(n_estimators=100)
+    model.fit(X_train, y_train)
+    # Get probability predictions
+    y_proba = model.predict_proba(X_val)
+    # Find optimal thresholds on validation set
+    thresholds = optimize_thresholds(y_val, y_proba, method='per_class')
+    # Apply thresholds to test set
+    y_pred = apply_thresholds(model.predict_proba(X_test), thresholds)
+"""
+from typing import Dict, Tuple, Union
+import warnings
+import numpy as np
+from sklearn.metrics import f1_score
+def optimize_thresholds(
+    y_true: np.ndarray,
+    y_proba: np.ndarray,
+    method: str = "per_class",
+    metric: str = "f1_weighted",
+    search_range: Tuple[float, float] = (0.1, 0.9),
+    n_steps: int = 50,
+) -> Union[float, np.ndarray]:
+    """
+    Optimize decision thresholds to maximize a given metric.
+    This function searches for optimal thresholds that convert probability
+    predictions to binary predictions (0/1) in a way that maximizes the
+    specified metric (default: weighted F1-score).
+    Args:
+        y_true: True binary labels, shape (n_samples, n_labels)
+        y_proba: Predicted probabilities, shape (n_samples, n_labels)
+        method: Threshold optimization method:
+                - 'global': Single threshold for all classes
+                - 'per_class': One threshold per class (default, recommended)
+        metric: Metric to optimize ('f1_weighted', 'f1_macro', 'f1_micro')
+        search_range: Range of thresholds to search (min, max)
+        n_steps: Number of threshold values to try
+    Returns:
+        - If method='global': Single float threshold
+        - If method='per_class': Array of thresholds, one per class
+    Example:
+        >>> y_true = np.array([[1, 0, 1], [0, 1, 0], [1, 1, 0]])
+        >>> y_proba = np.array([[0.9, 0.3, 0.7], [0.2, 0.8, 0.4], [0.85, 0.6, 0.3]])
+        >>> thresholds = optimize_thresholds(y_true, y_proba, method='per_class')
+        >>> print(thresholds)  # Array of 3 thresholds, one per class
+    """
+    if y_true.shape != y_proba.shape:
+        raise ValueError(f"Shape mismatch: y_true {y_true.shape} vs y_proba {y_proba.shape}")
+    if method == "global":
+        return _optimize_global_threshold(y_true, y_proba, metric, search_range, n_steps)
+    elif method == "per_class":
+        return _optimize_per_class_thresholds(y_true, y_proba, metric, search_range, n_steps)
+    else:
+        raise ValueError(f"Invalid method: {method}. Must be 'global' or 'per_class'")
+def _optimize_global_threshold(
+    y_true: np.ndarray,
+    y_proba: np.ndarray,
+    metric: str,
+    search_range: Tuple[float, float],
+    n_steps: int,
+) -> float:
+    """
+    Find single optimal threshold for all classes.
+    This approach is faster but less flexible than per-class optimization.
+    Useful when classes have similar distributions.
+    """
+    thresholds_to_try = np.linspace(search_range[0], search_range[1], n_steps)
+    best_threshold = 0.5
+    best_score = -np.inf
+    for threshold in thresholds_to_try:
+        y_pred = (y_proba >= threshold).astype(int)
+        score = _compute_score(y_true, y_pred, metric)
+        if score > best_score:
+            best_score = score
+            best_threshold = threshold
+    print(f"Optimal global threshold: {best_threshold:.3f} (score: {best_score:.4f})")
+    return best_threshold
+def _optimize_per_class_thresholds(
+    y_true: np.ndarray,
+    y_proba: np.ndarray,
+    metric: str,
+    search_range: Tuple[float, float],
+    n_steps: int,
+) -> np.ndarray:
+    """
+    Find optimal threshold for each class independently.
+    This approach is more flexible and typically yields better results
+    for imbalanced multi-label problems, but is slower.
+    """
+    n_classes = y_true.shape[1]
+    optimal_thresholds = np.zeros(n_classes)
+    thresholds_to_try = np.linspace(search_range[0], search_range[1], n_steps)
+    print(f"Optimizing thresholds for {n_classes} classes...")
+    for class_idx in range(n_classes):
+        y_true_class = y_true[:, class_idx]
+        y_proba_class = y_proba[:, class_idx]
+        # Skip classes with no positive samples
+        if y_true_class.sum() == 0:
+            optimal_thresholds[class_idx] = 0.5
+            warnings.warn(
+                f"Class {class_idx} has no positive samples, using default threshold 0.5"
+            )
+            continue
+        best_threshold = 0.5
+        best_score = -np.inf
+        for threshold in thresholds_to_try:
+            y_pred_class = (y_proba_class >= threshold).astype(int)
+            # Compute binary F1 for this class
+            try:
+                score = f1_score(y_true_class, y_pred_class, average="binary", zero_division=0)
+            except Exception:
+                continue
+            if score > best_score:
+                best_score = score
+                best_threshold = threshold
+        optimal_thresholds[class_idx] = best_threshold
+    print(
+        f"Threshold statistics: min={optimal_thresholds.min():.3f}, "
+        f"max={optimal_thresholds.max():.3f}, mean={optimal_thresholds.mean():.3f}"
+    )
+    return optimal_thresholds
+def _compute_score(y_true: np.ndarray, y_pred: np.ndarray, metric: str) -> float:
+    """Compute the specified metric."""
+    if metric == "f1_weighted":
+        return f1_score(y_true, y_pred, average="weighted", zero_division=0)
+    elif metric == "f1_macro":
+        return f1_score(y_true, y_pred, average="macro", zero_division=0)
+    elif metric == "f1_micro":
+        return f1_score(y_true, y_pred, average="micro", zero_division=0)
+    else:
+        raise ValueError(f"Unsupported metric: {metric}")
+def apply_thresholds(y_proba: np.ndarray, thresholds: Union[float, np.ndarray]) -> np.ndarray:
+    """
+    Apply thresholds to probability predictions to get binary predictions.
+    Args:
+        y_proba: Predicted probabilities, shape (n_samples, n_labels)
+        thresholds: Threshold(s) to apply:
+                   - Single float: same threshold for all classes
+                   - Array: one threshold per class
+    Returns:
+        Binary predictions, shape (n_samples, n_labels)
+    Example:
+        >>> y_proba = np.array([[0.9, 0.3, 0.7], [0.2, 0.8, 0.4]])
+        >>> thresholds = np.array([0.5, 0.4, 0.6])
+        >>> y_pred = apply_thresholds(y_proba, thresholds)
+        >>> print(y_pred)
+        [[1 0 1]
+         [0 1 0]]
+    """
+    if isinstance(thresholds, float):
+        # Global threshold
+        return (y_proba >= thresholds).astype(int)
+    else:
+        # Per-class thresholds
+        if len(thresholds) != y_proba.shape[1]:
+            raise ValueError(
+                f"Number of thresholds ({len(thresholds)}) must match "
+                f"number of classes ({y_proba.shape[1]})"
+            )
+        # Broadcasting: compare each column with its threshold
+        return (y_proba >= thresholds[np.newaxis, :]).astype(int)
+def evaluate_with_thresholds(
+    model,
+    X_val: np.ndarray,
+    y_val: np.ndarray,
+    X_test: np.ndarray,
+    y_test: np.ndarray,
+    method: str = "per_class",
+) -> Dict:
+    """
+    Complete workflow: optimize thresholds on validation set and evaluate on test set.
+    This function encapsulates the entire threshold optimization pipeline:
+    1. Get probability predictions on validation set
+    2. Optimize thresholds using validation data
+    3. Apply optimized thresholds to test set
+    4. Compare with default threshold (0.5)
+    Args:
+        model: Trained model with predict_proba method
+        X_val: Validation features
+        y_val: Validation labels (binary)
+        X_test: Test features
+        y_test: Test labels (binary)
+        method: 'global' or 'per_class'
+    Returns:
+        Dictionary with results:
+        - 'thresholds': Optimized thresholds
+        - 'f1_default': F1-score with default threshold (0.5)
+        - 'f1_optimized': F1-score with optimized thresholds
+        - 'improvement': Absolute improvement in F1-score
+    Example:
+        >>> results = evaluate_with_thresholds(model, X_val, y_val, X_test, y_test)
+        >>> print(f"F1 improvement: {results['improvement']:.4f}")
+    """
+    # Get probability predictions
+    print("Getting probability predictions on validation set...")
+    y_val_proba = model.predict_proba(X_val)
+    # Handle MultiOutputClassifier (returns list of arrays)
+    if isinstance(y_val_proba, list):
+        y_val_proba = np.column_stack([proba[:, 1] for proba in y_val_proba])
+    # Optimize thresholds
+    print(f"Optimizing thresholds ({method})...")
+    thresholds = optimize_thresholds(y_val, y_val_proba, method=method)
+    # Evaluate on test set
+    print("Evaluating on test set...")
+    y_test_proba = model.predict_proba(X_test)
+    # Handle MultiOutputClassifier
+    if isinstance(y_test_proba, list):
+        y_test_proba = np.column_stack([proba[:, 1] for proba in y_test_proba])
+    # Default predictions (threshold=0.5)
+    y_test_pred_default = (y_test_proba >= 0.5).astype(int)
+    f1_default = f1_score(y_test, y_test_pred_default, average="weighted", zero_division=0)
+    # Optimized predictions
+    y_test_pred_optimized = apply_thresholds(y_test_proba, thresholds)
+    f1_optimized = f1_score(y_test, y_test_pred_optimized, average="weighted", zero_division=0)
+    improvement = f1_optimized - f1_default
+    print("\nResults:")
+    print(f"  F1-score (default threshold=0.5): {f1_default:.4f}")
+    print(f"  F1-score (optimized thresholds):  {f1_optimized:.4f}")
+    print(f"  Improvement: {improvement:+.4f} ({improvement / f1_default * 100:+.2f}%)")
+    return {
+        "thresholds": thresholds,
+        "f1_default": f1_default,
+        "f1_optimized": f1_optimized,
+        "improvement": improvement,
+        "y_pred_optimized": y_test_pred_optimized,
+    }

models/.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+/random_forest_tfidf_gridsearch.pkl
+/random_forest_tfidf_gridsearch_adasyn_pca.pkl
+/random_forest_tfidf_gridsearch_ros.pkl
+/random_forest_tfidf_gridsearch_smote.pkl
+/lightgbm_tfidf_gridsearch.pkl
+/lightgbm_tfidf_gridsearch_smote.pkl
+/pca_tfidf_adasyn.pkl
+/label_names.pkl
+/tfidf_vectorizer.pkl
+/random_forest_embedding_gridsearch.pkl
+/random_forest_embedding_gridsearch_smote.pkl

models/.gitkeep ADDED Viewed

File without changes

models/README.md ADDED Viewed

	@@ -0,0 +1,206 @@

+---
+language: en
+license: mit
+tags:
+- multi-label-classification
+- tfidf
+- embeddings
+- random-forest
+- oversampling
+- mlsmote
+- software-engineering
+datasets:
+- NLBSE/SkillCompetition
+model-index:
+- name: random_forest_tfidf_gridsearch
+  results:
+  - status: success
+    metrics:
+      cv_best_f1_micro: 0.595038375202279
+      test_precision_micro: 0.690371373744215
+      test_recall_micro: 0.5287455692919513
+      test_f1_micro: 0.5988446098110252
+    params:
+      estimator__max_depth: '10'
+      estimator__min_samples_split: '2'
+      estimator__n_estimators: '200'
+      feature_type: embedding
+      model_type: RandomForest + MultiOutput
+      use_cleaned: 'True'
+      oversampling: 'False'
+    dvc:
+      path: random_forest_tfidf_gridsearch.pkl
+- name: random_forest_tfidf_gridsearch_smote
+  results:
+  - status: success
+    metrics:
+      cv_best_f1_micro: 0.59092598557871
+      test_precision_micro: 0.6923300238053766
+      test_recall_micro: 0.5154318319356791
+      test_f1_micro: 0.59092598557871
+    params:
+      feature_type: tfidf
+      oversampling: 'MLSMOTE (RandomOverSampler fallback)'
+    dvc:
+      path: random_forest_tfidf_gridsearch_smote.pkl
+- name: random_forest_embedding_gridsearch
+  results:
+  - status: success
+    metrics:
+      cv_best_f1_micro: 0.6012826418169578
+      test_precision_micro: 0.703060266254212
+      test_recall_micro: 0.5252460640075934
+      test_f1_micro: 0.6012826418169578
+    params:
+      feature_type: embedding
+      oversampling: 'False'
+    dvc:
+      path: random_forest_embedding_gridsearch.pkl
+- name: random_forest_embedding_gridsearch_smote
+  results:
+  - status: success
+    metrics:
+      cv_best_f1_micro: 0.5962084744755453
+      test_precision_micro: 0.7031004709576139
+      test_recall_micro: 0.5175288364319172
+      test_f1_micro: 0.5962084744755453
+    params:
+      feature_type: embedding
+      oversampling: 'MLSMOTE (RandomOverSampler fallback)'
+    dvc:
+      path: random_forest_embedding_gridsearch_smote.pkl
+---
+Model cards for committed models
+Overview
+- This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling).
+- For dataset provenance and preprocessing details see `data/README.md`.
+1) random_forest_tfidf_gridsearch
+Model details
+- Name: `random_forest_tfidf_gridsearch`
+- Organization: Hopcroft (se4ai2526-uniba)
+- Model type: `RandomForestClassifier` wrapped in `MultiOutputClassifier` for multi-label outputs
+- Branch: `Milestone-4`
+Intended use
+- Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation.
+Training data and preprocessing
+- Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project.
+- Features: TF‑IDF (unigrams and bigrams), up to `MAX_TFIDF_FEATURES=5000`.
+- Feature and label files are referenced via `get_feature_paths(feature_type='tfidf', use_cleaned=True)` in `config.py`.
+Evaluation
+- Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split.
+- Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing `f1_micro`.
+- MLflow run: `random_forest_tfidf_gridsearch` (see `hopcroft_skill_classification_tool_competition/config.py`).
+Limitations and recommendations
+- Trained on Java repositories; generalization to other languages is not ensured.
+- Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required.
+Usage
+- Artifact path: `models/random_forest_tfidf_gridsearch.pkl`.
+- Example:
+  ```python
+  import joblib
+  model = joblib.load('models/random_forest_tfidf_gridsearch.pkl')
+  y = model.predict(X_tfidf)
+  ```
+2) random_forest_tfidf_gridsearch_smote
+Model details
+- Name: `random_forest_tfidf_gridsearch_smote`
+- Model type: `RandomForestClassifier` inside `MultiOutputClassifier` trained with multi‑label oversampling
+Intended use
+- Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training.
+Training and preprocessing
+- Features: TF‑IDF (same configuration as the baseline).
+- Oversampling: local MLSMOTE implementation when available; otherwise `RandomOverSampler`. Oversampling metadata (method and synthetic sample counts) are logged to MLflow.
+- Training script: `hopcroft_skill_classification_tool_competition/modeling/train.py` (action `smote`).
+Evaluation
+- MLflow run: `random_forest_tfidf_gridsearch_smote`.
+Limitations and recommendations
+- Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment.
+Usage
+- Artifact path: `models/random_forest_tfidf_gridsearch_smote.pkl`.
+3) random_forest_embedding_gridsearch
+Model details
+- Name: `random_forest_embedding_gridsearch`
+- Features: sentence embeddings produced by `all-MiniLM-L6-v2` (see `config.EMBEDDING_MODEL_NAME`).
+Intended use
+- Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping.
+Training and preprocessing
+- Embeddings generated and stored via `get_feature_paths(feature_type='embedding', use_cleaned=True)`.
+- Training script: see `hopcroft_skill_classification_tool_competition/modeling/train.py`.
+Evaluation
+- MLflow run: `random_forest_embedding_gridsearch`.
+Limitations and recommendations
+- Embeddings encode dataset biases; verify performance when transferring to other repositories or languages.
+Usage
+- Artifact path: `models/random_forest_embedding_gridsearch.pkl`.
+- Example:
+  ```python
+  model.predict(X_embeddings)
+  ```
+4) random_forest_embedding_gridsearch_smote
+Model details
+- Name: `random_forest_embedding_gridsearch_smote`
+- Combines embedding features with multi‑label oversampling to address rare labels.
+Training and evaluation
+- Oversampling: MLSMOTE preferred; `RandomOverSampler` fallback if MLSMOTE is unavailable.
+- MLflow run: `random_forest_embedding_gridsearch_smote`.
+Limitations and recommendations
+- Review synthetic examples and re‑evaluate on target data prior to deployment.
+Usage
+- Artifact path: `models/random_forest_embedding_gridsearch_smote.pkl`.
+Publishing guidance for Hugging Face Hub
+- The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing:
+  - `README.md` (this file)
+  - model artifact(s) (`*.pkl`)
+  - vectorizer(s) and label map (e.g. `tfidf_vectorizer.pkl`, `label_names.pkl`)
+  - a minimal inference example or notebook
+  Evaluation Data and Protocol
+  - Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation.
+  - Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing `f1_micro`.
+  - Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML `model-index` for each model).
+  Quantitative Analyses
+  - Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model.
+  - Where available, `cv_best_f1_micro` is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README.
+  - Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses.
+  How Metrics Were Computed
+  - Metrics were computed using scikit‑learn's `precision_score`, `recall_score`, and `f1_score` with `average='micro'` and `zero_division=0` on the held‑out test labels and model predictions.
+  - Test feature and label files used are available under `data/processed/tfidf/` and `data/processed/embedding/` (paths referenced from `hopcroft_skill_classification_tool_competition.config.get_feature_paths`).
+  Ethical Considerations and Caveats
+  - The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed.
+  - Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying.
+  - The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning.

models/kept_label_indices.npy ADDED Viewed

Binary file (1.26 kB). View file

models/label_names.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: bd94d38e415f8dc2aaee3f60b6776483
+  size: 6708
+  hash: md5
+  path: label_names.pkl

models/random_forest_embedding_gridsearch.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: e1c1c0290e0c6036ee798275fdbad61c
+  size: 346568353
+  hash: md5
+  path: random_forest_embedding_gridsearch.pkl

models/random_forest_embedding_gridsearch_smote.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 4d5379a3847341f8de423778b94537b0
+  size: 1035016993
+  hash: md5
+  path: random_forest_embedding_gridsearch_smote.pkl

models/random_forest_tfidf_gridsearch.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 39165e064d60e4bd0688e6c7aa94258c
+  size: 137359137
+  hash: md5
+  path: random_forest_tfidf_gridsearch.pkl

models/random_forest_tfidf_gridsearch_adasyn_pca.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 5e0024de62e0c693cb41677084bb0fd5
+  size: 382639449
+  hash: md5
+  path: random_forest_tfidf_gridsearch_adasyn_pca.pkl

models/random_forest_tfidf_gridsearch_ros.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 3fda04bfcc26d1fb4350b8a31b80ebaf
+  size: 3011992009
+  hash: md5
+  path: random_forest_tfidf_gridsearch_ros.pkl

models/random_forest_tfidf_gridsearch_smote.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 7e39607fd740c69373376133b8c4f87b
+  size: 371297857
+  hash: md5
+  path: random_forest_tfidf_gridsearch_smote.pkl

models/tfidf_vectorizer.pkl.dvc ADDED Viewed

	@@ -0,0 +1,5 @@

+outs:
+- md5: 6f8eab4e3e9dbb44a65d6387e061d7ac
+  size: 76439
+  hash: md5
+  path: tfidf_vectorizer.pkl

notebooks/.gitkeep ADDED Viewed

File without changes