🧠 Scikit-learn GitHub Issues – Multilabel Classifier

This repository contains a multilabel text classification model trained to predict GitHub issue labels for the scikit-learn project based on issue text and comments.

The model is suitable for:

automated issue triage
label recommendation
downstream semantic search and filtering pipelines

🔍 Task

Multilabel Text Classification

Each GitHub issue can have multiple labels (e.g. Bug, Documentation, module:linear_model).
The model predicts all relevant labels for a given issue text.

📦 Dataset

Source: GitHub Issues from the scikit-learn/scikit-learn repository
Collection method: Custom GitHub REST API pipeline
Preprocessing steps:
- Included open and closed issues
- Excluded pull requests
- Retrieved all issue comments
- Exploded comments so each sample contains:
  - issue title
  - issue body
  - comments
- Converted labels to multi-hot vectors
Dataset on Hugging Face:
👉 https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel

Final dataset size: ~12,000 samples
Number of unique labels: ~20+

🧱 Model

Base model: distilbert-base-uncased
Architecture: AutoModelForSequenceClassification
Problem type: multi_label_classification
Loss function: Binary Cross Entropy with Logits
Activation: Sigmoid
Prediction threshold: 0.5

📊 Evaluation Metrics

Metric	Score
Micro F1	0.857
Macro F1	0.263
Epochs	3

Notes:

Micro F1 reflects strong overall performance.
Lower Macro F1 is expected due to severe label imbalance, common in real-world GitHub issue datasets.

🧪 Training Details

Optimizer: AdamW
Learning rate: 2e-5
Batch size: 16
Max sequence length: 256
Validation split: 10%
Best model selection: micro-F1
Trained on GPU

🚀 Inference Example

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Talip7/scikit-learn-multilabel-classifier",
    return_all_scores=True
)

text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""

outputs = classifier(text)

labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)

🔗 Intended Use

Automated GitHub issue labeling

Developer productivity tools

Search and recommendation systems

Foundation for semantic search + classification pipelines

⚠️ Limitations

Rare labels have limited representation

Threshold-based predictions may require tuning per use case

Model is domain-specific to scikit-learn GitHub issues

🛣️ Future Work

Joint semantic search + multilabel prediction

👤 Author

Talip7