π§ Scikit-learn GitHub Issues β Multilabel Classifier
This repository contains a multilabel text classification model trained to predict GitHub issue labels for the scikit-learn project based on issue text and comments.
The model is suitable for:
- automated issue triage
- label recommendation
- downstream semantic search and filtering pipelines
π Task
Multilabel Text Classification
Each GitHub issue can have multiple labels (e.g. Bug, Documentation, module:linear_model).
The model predicts all relevant labels for a given issue text.
π¦ Dataset
Source: GitHub Issues from the
scikit-learn/scikit-learnrepositoryCollection method: Custom GitHub REST API pipeline
Preprocessing steps:
- Included open and closed issues
- Excluded pull requests
- Retrieved all issue comments
- Exploded comments so each sample contains:
- issue title
- issue body
- comments
- Converted labels to multi-hot vectors
Dataset on Hugging Face:
π https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel
Final dataset size: ~12,000 samples
Number of unique labels: ~20+
π§± Model
- Base model:
distilbert-base-uncased - Architecture:
AutoModelForSequenceClassification - Problem type:
multi_label_classification - Loss function: Binary Cross Entropy with Logits
- Activation: Sigmoid
- Prediction threshold: 0.5
π Evaluation Metrics
| Metric | Score |
|---|---|
| Micro F1 | 0.857 |
| Macro F1 | 0.263 |
| Epochs | 3 |
Notes:
- Micro F1 reflects strong overall performance.
- Lower Macro F1 is expected due to severe label imbalance, common in real-world GitHub issue datasets.
π§ͺ Training Details
- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256
- Validation split: 10%
- Best model selection: micro-F1
- Trained on GPU
π Inference Example
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Talip7/scikit-learn-multilabel-classifier",
return_all_scores=True
)
text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""
outputs = classifier(text)
labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)
π Intended Use
Automated GitHub issue labeling
Developer productivity tools
Search and recommendation systems
Foundation for semantic search + classification pipelines
β οΈ Limitations
Rare labels have limited representation
Threshold-based predictions may require tuning per use case
Model is domain-specific to scikit-learn GitHub issues
π£οΈ Future Work
Joint semantic search + multilabel prediction
π€ Author
Talip7