File size: 2,914 Bytes
0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 e0d6c35 0331aa3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | # π§ Scikit-learn GitHub Issues β Multilabel Classifier
This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.
The model is suitable for:
- automated issue triage
- label recommendation
- downstream semantic search and filtering pipelines
---
## π Task
**Multilabel Text Classification**
Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).
The model predicts **all relevant labels** for a given issue text.
---
## π¦ Dataset
- **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository
- **Collection method**: Custom GitHub REST API pipeline
- **Preprocessing steps**:
- Included **open and closed issues**
- Excluded **pull requests**
- Retrieved **all issue comments**
- Exploded comments so each sample contains:
- issue title
- issue body
- comments
- Converted labels to **multi-hot vectors**
- **Dataset on Hugging Face**:
π https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel
**Final dataset size**: ~12,000 samples
**Number of unique labels**: ~20+
---
## π§± Model
- **Base model**: `distilbert-base-uncased`
- **Architecture**: `AutoModelForSequenceClassification`
- **Problem type**: `multi_label_classification`
- **Loss function**: Binary Cross Entropy with Logits
- **Activation**: Sigmoid
- **Prediction threshold**: 0.5
---
## π Evaluation Metrics
| Metric | Score |
|-----------|-------|
| Micro F1 | **0.857** |
| Macro F1 | 0.263 |
| Epochs | 3 |
**Notes**:
- Micro F1 reflects strong overall performance.
- Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.
---
## π§ͺ Training Details
- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256
- Validation split: 10%
- Best model selection: micro-F1
- Trained on GPU
---
## π Inference Example
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Talip7/scikit-learn-multilabel-classifier",
return_all_scores=True
)
text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""
outputs = classifier(text)
labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)
```
---
## π Intended Use
Automated GitHub issue labeling
Developer productivity tools
Search and recommendation systems
Foundation for semantic search + classification pipelines
---
## β οΈ Limitations
Rare labels have limited representation
Threshold-based predictions may require tuning per use case
Model is domain-specific to scikit-learn GitHub issues
---
## π£οΈ Future Work
Joint semantic search + multilabel prediction
---
## π€ Author
Talip7 |