---
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs
---

# Hopcroft Skill Classification

[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft)
[![MLflow](https://img.shields.io/badge/MLflow-Tracking-blue)](https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow)

**Multi-label skill classification for GitHub issues and pull requests** — Automatically identify technical skills required to resolve software issues using machine learning.

---

## Overview

Hopcroft is an ML-enabled system that classifies GitHub issues into 217 technical skill categories, enabling automated developer assignment and optimized resource allocation. Built following professional MLOps and Software Engineering standards.

### Key Features

- 🎯 **Multi-label Classification**: Predict multiple skills per issue
- 🚀 **REST API**: FastAPI with Swagger documentation
- 🖥️ **Web Interface**: Streamlit GUI for interactive predictions
- 📊 **Monitoring**: Prometheus/Grafana dashboards with drift detection
- 🔄 **CI/CD**: GitHub Actions with Docker deployment
- 📈 **Experiment Tracking**: MLflow on DagsHub

---

## Architecture

```mermaid
graph TB
    subgraph "Data Layer"
        A[(SkillScope DB)] --> B[Feature Engineering]
        B --> C[TF-IDF / Embeddings]
    end
    
    subgraph "ML Pipeline"
        C --> D[Model Training]
        D --> E[(MLflow Tracking)]
        D --> F[Random Forest Model]
    end
    
    subgraph "Serving Layer"
        F --> G[FastAPI Service]
        G --> H[predict endpoint]
        G --> I[predictions endpoint]
        G --> J[health endpoint]
    end
    
    subgraph "Frontend"
        G --> K[Streamlit GUI]
    end
    
    subgraph "Monitoring"
        G --> L[Prometheus]
        L --> M[Grafana]
        N[Drift Detection] --> L
    end
    
    subgraph "Deployment"
        O[GitHub Actions] --> P[Docker Build]
        P --> Q[HF Spaces]
    end
```

---

## Documentation

| Document | Description |
|----------|-------------|
| 📋 [Milestone Summaries](docs/milestone_summaries.md) | All 6 project phases documented |
| 📖 [User Guide](docs/user_guide.md) | Setup, API, GUI, testing, monitoring |
| 🏗️ [Design Choices](docs/design_choices.md) | Technical decisions & rationale |
| 🎯 [ML Canvas](docs/ML%20Canvas.md) | Requirements engineering framework |
| ✅ [Testing & Validation](docs/testing_and_validation.md) | QA strategy & results |
| 📊 [Model Card](models/README.md) | Model details & performance |
| 📊 [Dataset Card](data/README.md) | Dataset details & preprocessing |
---

## Quick Start

### Docker (Recommended)

```bash
# Clone and configure
git clone https://github.com/se4ai2526-uniba/Hopcroft.git
cd Hopcroft
cp .env.example .env
# Edit .env with your DagsHub credentials

# Start services
docker compose -f docker/docker-compose.yml up -d --build
```

**Access (Local):**
- 🌐 **API Docs**: http://localhost:8080/docs
- 🖥️ **GUI**: http://localhost:8501
- ❤️ **Health**: http://localhost:8080/health

### Local Development

```bash
# Setup environment
python -m venv venv && source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt && pip install -e .

# Start API
make api-dev

# Start GUI (new terminal)
streamlit run hopcroft_skill_classification_tool_competition/streamlit_app.py
```

---

## Project Structure

```
├── hopcroft_skill_classification_tool_competition/
│   ├── main.py              # FastAPI application
│   ├── streamlit_app.py     # Streamlit GUI
│   ├── features.py          # Feature engineering
│   ├── modeling/            # Training & prediction
│   └── config.py            # Configuration
├── data/                    # DVC-tracked datasets
├── models/                  # DVC-tracked models
├── tests/                   # Pytest test suites
├── monitoring/              # Prometheus, Grafana, Locust
├── docker/                  # Docker configurations
├── docs/                    # Documentation
└── .github/workflows/       # CI/CD pipelines
```

---

## API Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/predict` | Classify single issue |
| `POST` | `/predict/batch` | Batch classification |
| `GET` | `/predictions` | List recent predictions |
| `GET` | `/predictions/{id}` | Get by MLflow run ID |
| `GET` | `/health` | Health check |
| `GET` | `/metrics` | Prometheus metrics |

**Example:**
```bash
curl -X POST "http://localhost:8080/predict" \
  -H "Content-Type: application/json" \
  -d '{"issue_text": "Fix OAuth2 authentication bug"}'
```

---

## Live Deployment
- **API**: https://dacrow13-hopcroft-skill-classification.hf.space/docs
- **GUI**: https://dacrow13-hopcroft-skill-classification.hf.space
- **MLflow**: https://dagshub.com/se4ai2526-uniba/Hopcroft/experiments
- **Prometheus**: https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/
- **Grafana**: https://dacrow13-hopcroft-skill-classification.hf.space/grafana/
- **Betterstack**: Alerting configured. [Alert System Evidence](monitoring/screenshots)

---

## Development

```bash
# Run tests
make test-all              # All tests
make test-behavioral       # ML behavioral tests
make validate-deepchecks   # Data validation

# Lint & format
make lint                  # Check code style
make format                # Auto-fix issues

# Training
make train-baseline-tfidf  # Train baseline model
```

---

## License

This project was developed as part of the SE4AI 2025-26 course at the University of Bari.