File size: 21,587 Bytes
dd8494c
 
 
 
 
 
 
8caf0a0
dd8494c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a894f75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd8494c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
---
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs
---

# Hopcroft_Skill-Classification-Tool-Competition

The task involves analyzing the relationship between issue characteristics and required skills, developing effective feature extraction methods that combine textual and code-context information, and implementing sophisticated multi-label classification approaches. Students may incorporate additional GitHub metadata to enhance model inputs, but must avoid using third-party classification engines or direct outputs from the provided database. The work requires careful attention to the multi-label nature of the problem, where each issue may require multiple different skills for resolution.

## Project Organization

```
β”œβ”€β”€ LICENSE            <- Open-source license if one is chosen
β”œβ”€β”€ Makefile           <- Makefile with convenience commands like `make data` or `make train`
β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ external       <- Data from third party sources.
β”‚   β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default mkdocs project; see www.mkdocs.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ pyproject.toml     <- Project configuration file with package metadata for 
β”‚                         hopcroft_skill_classification_tool_competition and configuration for tools like black
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.cfg          <- Configuration file for flake8
β”‚
└── hopcroft_skill_classification_tool_competition   <- Source code for use in this project.
    β”‚
    β”œβ”€β”€ __init__.py             <- Makes hopcroft_skill_classification_tool_competition a Python module
    β”‚
    β”œβ”€β”€ config.py               <- Store useful variables and configuration
    β”‚
    β”œβ”€β”€ dataset.py              <- Scripts to download or generate data
    β”‚
    β”œβ”€β”€ features.py             <- Code to create features for modeling
    β”‚
    β”œβ”€β”€ modeling                
    β”‚   β”œβ”€β”€ __init__.py 
    β”‚   β”œβ”€β”€ predict.py          <- Code to run model inference with trained models          
    β”‚   └── train.py            <- Code to train models
    β”‚
    └── plots.py                <- Code to create visualizations
```

--------

## Setup

### MLflow Credentials Configuration

Set up DagsHub credentials for MLflow tracking.

**Get your token:** [DagsHub](https://dagshub.com) β†’ Profile β†’ Settings β†’ Tokens

#### Option 1: Using `.env` file (Recommended for local development)

```bash
# Copy the template
cp .env.example .env

# Edit .env with your credentials
```

Your `.env` file should contain:
```
MLFLOW_TRACKING_URI=https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_token
```

> [!NOTE]
> The `.env` file is git-ignored for security. Never commit credentials to version control.

#### Option 2: Using Docker Compose

When using Docker Compose, the `.env` file is automatically loaded via `env_file` directive in `docker-compose.yml`.

```bash
# Start the service (credentials loaded from .env)
docker compose up --build
```

--------

## CI Configuration

[![CI Pipeline](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml/badge.svg)](https://github.com/se4ai2526-uniba/Hopcroft/actions/workflows/ci.yml)

This project uses automatically triggered GitHub Actions triggers for Continuous Integration.

### Secrets

To enable DVC model pulling, configure these Repository Secrets:

- `DAGSHUB_USERNAME`: DagsHub username.
- `DAGSHUB_TOKEN`: DagsHub access token.

--------

## Milestone Summary

### Milestone 1
We compiled the ML Canvas and defined:
- Problem: multi-label classification of skills for PR/issues.
- Stakeholders and business/research goals.
- Data sources (SkillScope DB) and constraints (no external classifiers).
- Success metrics (micro-F1, imbalance handling, experiment tracking).
- Risks (label imbalance, text noise, multi-label complexity) and mitigations.

### Milestone 2
We implemented the essential end-to-end infrastructure to go from data to tracked modeling experiments:

1. Data Management
    - DVC setup (raw dataset and TF-IDF features tracked) with DagsHub remote; dedicated gitignores for data/models.

2. Data Ingestion & EDA
    - `dataset.py` to download/extract SkillScope from Hugging Face (zip β†’ SQLite) with cleanup.
    - Initial exploration notebook `notebooks/1.0-initial-data-exploration.ipynb` (schema, text stats, label distribution).

3. Feature Engineering
    - `features.py`: GitHub text cleaning (URL/HTML/markdown removal, normalization, Porter stemming) and TF-IDF (uni+bi-grams) saved as NumPy (`features_tfidf.npy`, `labels_tfidf.npy`).

4. Central Config
    - `config.py` with project paths, training settings, RF param grid, MLflow URI/experiments, PCA/ADASYN, feature constants.

5. Modeling & Experiments
    - Unified `modeling/train.py` with actions: baseline RF, MLSMOTE, ROS, ADASYN+PCA, LightGBM, LightGBM+MLSMOTE, and inference.
    - GridSearchCV (micro-F1), MLflow logging, removal of all-zero labels, multilabel-stratified splits (with fallback).

6. Imbalance Handling
    - Local `mlsmote.py` (multi-label oversampling) with fallback to `RandomOverSampler`; dedicated ADASYN+PCA pipeline.

7. Tracking & Reproducibility
    - Remote MLflow (DagsHub) with README credential setup; DVC-tracked models and auxiliary artifacts (e.g., PCA, kept label indices).

8. Tooling
    - Updated `requirements.txt` (lightgbm, imbalanced-learn, iterative-stratification, huggingface-hub, dvc, mlflow, nltk, seaborn, etc.) and extended Makefile targets (`data`, `features`).

### Milestone 3 (QA)
We implemented a comprehensive testing and validation framework to ensure data quality and model robustness:

1. **Data Cleaning Pipeline**
    - `data_cleaning.py`: Removes duplicates (481 samples), resolves label conflicts via majority voting (640 samples), filters sparse samples incompatible with SMOTE, and ensures train-test separation without leakage.
    - Final cleaned dataset: 6,673 samples (from 7,154 original), 80/20 stratified split.

2. **Great Expectations Validation** (10 tests)
    - Database integrity, feature matrix validation (no NaN/Inf, sparsity checks), label format validation (binary {0,1}), feature-label consistency.
    - Label distribution for stratification (min 5 occurrences), SMOTE compatibility (min 10 non-zero features), duplicate detection, train-test separation, label consistency.
    - All 10 tests pass on cleaned data; comprehensive JSON reports in `reports/great_expectations/`.

3. **Deepchecks Validation** (24 checks across 2 suites)
    - Data Integrity Suite (92% score): validates duplicates, label conflicts, nulls, data types, feature correlation.
    - Train-Test Validation Suite (100% score): **zero data leakage**, proper train/test split, feature/label drift analysis.
    - Cleaned data achieved production-ready status (96% overall score).

4. **Behavioral Testing** (36 tests)
    - Invariance tests (9): typo robustness, synonym substitution, case insensitivity, punctuation/URL noise tolerance.
    - Directional tests (10): keyword addition effects, technical detail impact on predictions.
    - Minimum Functionality Tests (17): basic skill predictions on clear examples (bug fixes, database work, API development, testing, DevOps).
    - All tests passed; comprehensive report in `reports/behavioral/`.

5. **Code Quality Analysis**
    - Ruff static analysis: 28 minor issues identified (unsorted imports, unused variables, f-strings), 100% fixable.
    - PEP 8 compliant, Black compatible (line length 88).

6. **Documentation**
    - Comprehensive `docs/testing_and_validation.md` with detailed test descriptions, execution commands, and analysis results.
    - Behavioral testing README with test categories, usage examples, and extension guide.

7. **Tooling**
    - Makefile targets: `validate-gx`, `validate-deepchecks`, `test-behavioral`, `test-complete`.
    - Automated test execution and report generation.

### Milestone 4 (API)
We implemented a production-ready FastAPI service for skill prediction with MLflow integration:

#### Features
- **REST API Endpoints**:
  - `POST /predict` - Predict skills for a GitHub issue (logs to MLflow)
  - `GET /predictions/{run_id}` - Retrieve prediction by MLflow run ID
  - `GET /predictions` - List recent predictions with pagination
  - `GET /health` - Health check endpoint
- **Model Management**: Loads trained Random Forest + TF-IDF vectorizer from `models/`
- **MLflow Tracking**: All predictions logged with metadata, probabilities, and timestamps
- **Input Validation**: Pydantic models for request/response validation
- **Interactive Docs**: Auto-generated Swagger UI and ReDoc

#### API Usage

**1. Start the API Server**
```bash
# Development mode (auto-reload)
make api-dev

# Production mode
make api-run
```
Server starts at: [http://127.0.0.1:8000](http://127.0.0.1:8000)

**2. Test Endpoints**

**Option A: Swagger UI (Recommended)**
- Navigate to: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
- Interactive interface to test all endpoints
- View request/response schemas

**Option B: Make Commands**
```bash
# Test all endpoints
make test-api-all

# Individual endpoints
make test-api-health        # Health check
make test-api-predict       # Single prediction
make test-api-list          # List predictions
```

#### Prerequisites
- Trained model: `models/random_forest_tfidf_gridsearch.pkl`
- TF-IDF vectorizer: `models/tfidf_vectorizer.pkl` (auto-saved during feature creation)
- Label names: `models/label_names.pkl` (auto-saved during feature creation)

#### MLflow Integration
- All predictions logged to: `https://dagshub.com/se4ai2526-uniba/Hopcroft.mlflow`
- Experiment: `skill_prediction_api`
- Tracked: input text, predictions, probabilities, metadata

#### Docker
Build and run the API in a container:
```bash
docker build -t hopcroft-api .
docker run --rm --name hopcroft-api -p 8080:8080 hopcroft-api
```

Endpoints:
- Swagger UI: [http://localhost:8080/docs](http://localhost:8080/docs)
- Health check: [http://localhost:8080/health](http://localhost:8080/health)

---

## Docker Compose Usage

Docker Compose orchestrates both the **API backend** and **Streamlit GUI** services with proper networking and configuration.

### Prerequisites

1. **Create your environment file:**
   ```bash
   cp .env.example .env
   ```

2. **Edit `.env`** with your actual credentials:
   ```
   MLFLOW_TRACKING_USERNAME=your_dagshub_username
   MLFLOW_TRACKING_PASSWORD=your_dagshub_token
   ```
   
   Get your token from: [https://dagshub.com/user/settings/tokens](https://dagshub.com/user/settings/tokens)

### Quick Start

#### 1. Build and Start All Services
Build both images and start the containers:
```bash
docker-compose up -d --build
```

| Flag | Description |
|------|-------------|
| `-d` | Run in detached mode (background) |
| `--build` | Rebuild images before starting (use when code/Dockerfile changes) |

**Available Services:**
- **API (FastAPI):** [http://localhost:8080/docs](http://localhost:8080/docs)
- **GUI (Streamlit):** [http://localhost:8501](http://localhost:8501)
- **Health Check:** [http://localhost:8080/health](http://localhost:8080/health)

#### 2. Stop All Services
Stop and remove containers and networks:
```bash
docker-compose down
```

| Flag | Description |
|------|-------------|
| `-v` | Also remove named volumes (e.g., `hopcroft-logs`): `docker-compose down -v` |
| `--rmi all` | Also remove images: `docker-compose down --rmi all` |

#### 3. Restart Services
After updating `.env` or configuration files:
```bash
docker-compose restart
```

Or for a full restart with environment reload:
```bash
docker-compose down
docker-compose up -d
```

#### 4. Check Status
View the status of all running services:
```bash
docker-compose ps
```

Or use Docker commands:
```bash
docker ps
```

#### 5. View Logs
Tail logs from both services in real-time:
```bash
docker-compose logs -f
```

View logs from a specific service:
```bash
docker-compose logs -f hopcroft-api
docker-compose logs -f hopcroft-gui
```

| Flag | Description |
|------|-------------|
| `-f` | Follow log output (stream new logs) |
| `--tail 100` | Show only last 100 lines: `docker-compose logs --tail 100` |

#### 6. Execute Commands in Container
Open an interactive shell inside a running container:
```bash
docker-compose exec hopcroft-api /bin/bash
docker-compose exec hopcroft-gui /bin/bash
```

Examples of useful commands inside the API container:
```bash
# Check installed packages
pip list

# Run Python interactively
python

# Check model file exists
ls -la /app/models/

# Verify environment variables
printenv | grep MLFLOW
```
```

### Architecture Overview

**Docker Compose orchestrates two services:**

```
docker-compose.yml
β”œβ”€β”€ hopcroft-api (FastAPI Backend)
β”‚   β”œβ”€β”€ Build: ./Dockerfile
β”‚   β”œβ”€β”€ Port: 8080:8080
β”‚   β”œβ”€β”€ Network: hopcroft-net
β”‚   β”œβ”€β”€ Environment: .env (MLflow credentials)
β”‚   β”œβ”€β”€ Volumes:
β”‚   β”‚   β”œβ”€β”€ ./hopcroft_skill_classification_tool_competition (hot reload)
β”‚   β”‚   └── hopcroft-logs:/app/logs (persistent logs)
β”‚   └── Health Check: /health endpoint
β”‚
β”œβ”€β”€ hopcroft-gui (Streamlit Frontend)
β”‚   β”œβ”€β”€ Build: ./Dockerfile.streamlit
β”‚   β”œβ”€β”€ Port: 8501:8501
β”‚   β”œβ”€β”€ Network: hopcroft-net
β”‚   β”œβ”€β”€ Environment: API_BASE_URL=http://hopcroft-api:8080
β”‚   β”œβ”€β”€ Volumes:
β”‚   β”‚   └── ./hopcroft_skill_classification_tool_competition/streamlit_app.py (hot reload)
β”‚   └── Depends on: hopcroft-api (waits for health check)
β”‚
└── hopcroft-net (bridge network)
```

**External Access:**
- API: http://localhost:8080
- GUI: http://localhost:8501

**Internal Communication:**
- GUI β†’ API: http://hopcroft-api:8080 (via Docker network)

### Services Description

**hopcroft-api (FastAPI Backend)**
- Purpose: FastAPI backend serving the ML model for skill classification
- Image: Built from `Dockerfile`
- Port: 8080 (maps to host 8080)
- Features:
  - Random Forest model with embedding features
  - MLflow experiment tracking
  - Auto-reload in development mode
  - Health check endpoint

**hopcroft-gui (Streamlit Frontend)**
- Purpose: Streamlit web interface for interactive predictions
- Image: Built from `Dockerfile.streamlit`
- Port: 8501 (maps to host 8501)
- Features:
  - User-friendly interface for skill prediction
  - Real-time communication with API
  - Automatic reconnection on API restart
  - Depends on API health before starting

### Development vs Production

**Development (default):**
- Auto-reload enabled (`--reload`)
- Source code mounted with bind mounts
- Custom command with hot reload
- GUI β†’ API via Docker network

**Production:**
- Auto-reload disabled
- Use built image only
- Use Dockerfile's CMD
- GUI β†’ API via Docker network

For **production deployment**, modify `docker-compose.yml` to remove bind mounts and disable reload.

### Troubleshooting

#### Issue: GUI shows "API is not available"
**Solution:**
1. Wait 30-60 seconds for API to fully initialize and become healthy
2. Refresh the GUI page (F5)
3. Check API health: `curl http://localhost:8080/health`
4. Check logs: `docker-compose logs hopcroft-api`

#### Issue: "500 Internal Server Error" on predictions
**Solution:**
1. Verify MLflow credentials in `.env` are correct
2. Restart services: `docker-compose down && docker-compose up -d`
3. Check environment variables: `docker exec hopcroft-api printenv | grep MLFLOW`

#### Issue: Changes to code not reflected
**Solution:**
- For Python code changes: Auto-reload is enabled, wait a few seconds
- For Dockerfile changes: Rebuild with `docker-compose up -d --build`
- For `.env` changes: Restart with `docker-compose down && docker-compose up -d`

#### Issue: Port already in use
**Solution:**
```bash
# Check what's using the port
netstat -ano | findstr :8080
netstat -ano | findstr :8501

# Stop existing containers
docker-compose down

# Or change ports in docker-compose.yml
```


--------

## Hugging Face Spaces Deployment

This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.

### 1. Setup Space
1. Create a new Space on Hugging Face.
2. Select **Docker** as the SDK.
3. Choose the **Blank** template or upload your code.

### 2. Configure Secrets
To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:

| Name | Type | Description |
|------|------|-------------|
| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |

> [!IMPORTANT]
> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.

### 3. Automated Startup
The deployment follows this automated flow:
1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
2. **scripts/start_space.sh**: 
   - Configures DVC with your secrets.
   - Pulls models from the DagsHub remote.
   - Starts the **FastAPI** backend (port 8000).
   - Starts the **Streamlit** frontend (port 8501).
   - Starts **Nginx** (port 7860) as a reverse proxy to route traffic.

### 4. Direct Access
Once deployed, your Space will be available at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`

The API documentation will be accessible at:
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`

--------

## Demo UI (Streamlit)

The Streamlit GUI provides an interactive web interface for the skill classification API.

### Features
- Real-time skill prediction from GitHub issue text
- Top-5 predicted skills with confidence scores
- Full predictions table with all skills
- API connection status indicator
- Responsive design

### Usage
1. Ensure both services are running: `docker-compose up -d`
2. Open the GUI in your browser: [http://localhost:8501](http://localhost:8501)
3. Enter a GitHub issue description in the text area
4. Click "Predict Skills" to get predictions
5. View results in the predictions table

### Architecture
- **Frontend**: Streamlit (Python web framework)
- **Communication**: HTTP requests to FastAPI backend via Docker network
- **Independence**: GUI and API run in separate containers
- **Auto-reload**: GUI code changes are reflected immediately (bind mount) 
> Both must run **simultaneously** in different terminals/containers.

### Quick Start

1. **Start the FastAPI backend:**
   ```bash
   fastapi dev hopcroft_skill_classification_tool_competition/main.py
   ```

2. **In a new terminal, start Streamlit:**
   ```bash
   streamlit run streamlit_app.py
   ```

3. **Open your browser:**
   - Streamlit UI: http://localhost:8501
   - FastAPI Docs: http://localhost:8000/docs

### Features

- Interactive web interface for skill prediction
- Real-time predictions with confidence scores
- Adjustable confidence threshold
- Multiple input modes (quick/detailed/examples)
- Visual result display
- API health monitoring

### Demo Walkthrough

#### Main Dashboard

![gui_main_dashboard](docs/img/gui_main_dashboard.png)

The main interface provides:
- **Sidebar**: API health status, confidence threshold slider, model info
- **Three input modes**: Quick Input, Detailed Input, Examples
#### Quick Input Mode

![gui_quick_input](docs/img/gui_quick_input.png)
Simply paste your GitHub issue text and click "Predict Skills"!

#### Prediction Results
![gui_detailed](docs/img/gui_detailed.png)
View:
- **Top predictions** with confidence scores
- **Full predictions table** with filtering
- **Processing metrics** (time, model version)
- **Raw JSON response** (expandable)

#### Detailed Input Mode

![gui_detailed_input](docs/img/gui_detailed_input.png)
Add optional metadata:
- Repository name
- PR number
- Detailed description

#### Example Gallery
![gui_ex](docs/img/gui_ex.png)

Test with pre-loaded examples:
- Authentication bugs
- ML features
- Database issues
- UI enhancements
  

### Usage

1. Enter GitHub issue/PR text in the input area
2. (Optional) Add description, repo name, PR number
3. Click "Predict Skills"
4. View results with confidence scores
5. Adjust threshold slider to filter predictions