File size: 2,914 Bytes

# 🧠 Scikit-learn GitHub Issues – Multilabel Classifier

This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.

The model is suitable for:
- automated issue triage
- label recommendation
- downstream semantic search and filtering pipelines

---

## 🔍 Task

**Multilabel Text Classification**

Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).  
The model predicts **all relevant labels** for a given issue text.

---

## 📦 Dataset

- **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository  
- **Collection method**: Custom GitHub REST API pipeline  
- **Preprocessing steps**:
  - Included **open and closed issues**
  - Excluded **pull requests**
  - Retrieved **all issue comments**
  - Exploded comments so each sample contains:
    - issue title
    - issue body
    - comments
  - Converted labels to **multi-hot vectors**

- **Dataset on Hugging Face**:  
  👉 https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel

**Final dataset size**: ~12,000 samples  
**Number of unique labels**: ~20+

---

## 🧱 Model

- **Base model**: `distilbert-base-uncased`
- **Architecture**: `AutoModelForSequenceClassification`
- **Problem type**: `multi_label_classification`
- **Loss function**: Binary Cross Entropy with Logits
- **Activation**: Sigmoid
- **Prediction threshold**: 0.5

---

## 📊 Evaluation Metrics

| Metric     | Score |
|-----------|-------|
| Micro F1  | **0.857** |
| Macro F1  | 0.263 |
| Epochs    | 3 |

**Notes**:
- Micro F1 reflects strong overall performance.
- Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.

---

## 🧪 Training Details

- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256
- Validation split: 10%
- Best model selection: micro-F1
- Trained on GPU

---

## 🚀 Inference Example

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Talip7/scikit-learn-multilabel-classifier",
    return_all_scores=True
)

text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""

outputs = classifier(text)

labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)
```

---

## 🔗 Intended Use

Automated GitHub issue labeling

Developer productivity tools

Search and recommendation systems

Foundation for semantic search + classification pipelines

---

## ⚠️ Limitations

Rare labels have limited representation

Threshold-based predictions may require tuning per use case

Model is domain-specific to scikit-learn GitHub issues

---

## 🛣️ Future Work

Joint semantic search + multilabel prediction

---

## 👤 Author

Talip7