Talip7's picture
Update README.md
0331aa3 verified
# 🧠 Scikit-learn GitHub Issues – Multilabel Classifier
This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.
The model is suitable for:
- automated issue triage
- label recommendation
- downstream semantic search and filtering pipelines
---
## πŸ” Task
**Multilabel Text Classification**
Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).
The model predicts **all relevant labels** for a given issue text.
---
## πŸ“¦ Dataset
- **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository
- **Collection method**: Custom GitHub REST API pipeline
- **Preprocessing steps**:
- Included **open and closed issues**
- Excluded **pull requests**
- Retrieved **all issue comments**
- Exploded comments so each sample contains:
- issue title
- issue body
- comments
- Converted labels to **multi-hot vectors**
- **Dataset on Hugging Face**:
πŸ‘‰ https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel
**Final dataset size**: ~12,000 samples
**Number of unique labels**: ~20+
---
## 🧱 Model
- **Base model**: `distilbert-base-uncased`
- **Architecture**: `AutoModelForSequenceClassification`
- **Problem type**: `multi_label_classification`
- **Loss function**: Binary Cross Entropy with Logits
- **Activation**: Sigmoid
- **Prediction threshold**: 0.5
---
## πŸ“Š Evaluation Metrics
| Metric | Score |
|-----------|-------|
| Micro F1 | **0.857** |
| Macro F1 | 0.263 |
| Epochs | 3 |
**Notes**:
- Micro F1 reflects strong overall performance.
- Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.
---
## πŸ§ͺ Training Details
- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256
- Validation split: 10%
- Best model selection: micro-F1
- Trained on GPU
---
## πŸš€ Inference Example
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Talip7/scikit-learn-multilabel-classifier",
return_all_scores=True
)
text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""
outputs = classifier(text)
labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)
```
---
## πŸ”— Intended Use
Automated GitHub issue labeling
Developer productivity tools
Search and recommendation systems
Foundation for semantic search + classification pipelines
---
## ⚠️ Limitations
Rare labels have limited representation
Threshold-based predictions may require tuning per use case
Model is domain-specific to scikit-learn GitHub issues
---
## πŸ›£οΈ Future Work
Joint semantic search + multilabel prediction
---
## πŸ‘€ Author
Talip7