Talip7's picture
Update README.md
0331aa3 verified

🧠 Scikit-learn GitHub Issues – Multilabel Classifier

This repository contains a multilabel text classification model trained to predict GitHub issue labels for the scikit-learn project based on issue text and comments.

The model is suitable for:

  • automated issue triage
  • label recommendation
  • downstream semantic search and filtering pipelines

πŸ” Task

Multilabel Text Classification

Each GitHub issue can have multiple labels (e.g. Bug, Documentation, module:linear_model).
The model predicts all relevant labels for a given issue text.


πŸ“¦ Dataset

  • Source: GitHub Issues from the scikit-learn/scikit-learn repository

  • Collection method: Custom GitHub REST API pipeline

  • Preprocessing steps:

    • Included open and closed issues
    • Excluded pull requests
    • Retrieved all issue comments
    • Exploded comments so each sample contains:
      • issue title
      • issue body
      • comments
    • Converted labels to multi-hot vectors
  • Dataset on Hugging Face:
    πŸ‘‰ https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel

Final dataset size: ~12,000 samples
Number of unique labels: ~20+


🧱 Model

  • Base model: distilbert-base-uncased
  • Architecture: AutoModelForSequenceClassification
  • Problem type: multi_label_classification
  • Loss function: Binary Cross Entropy with Logits
  • Activation: Sigmoid
  • Prediction threshold: 0.5

πŸ“Š Evaluation Metrics

Metric Score
Micro F1 0.857
Macro F1 0.263
Epochs 3

Notes:

  • Micro F1 reflects strong overall performance.
  • Lower Macro F1 is expected due to severe label imbalance, common in real-world GitHub issue datasets.

πŸ§ͺ Training Details

  • Optimizer: AdamW
  • Learning rate: 2e-5
  • Batch size: 16
  • Max sequence length: 256
  • Validation split: 10%
  • Best model selection: micro-F1
  • Trained on GPU

πŸš€ Inference Example

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Talip7/scikit-learn-multilabel-classifier",
    return_all_scores=True
)

text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""

outputs = classifier(text)

labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)

πŸ”— Intended Use

Automated GitHub issue labeling

Developer productivity tools

Search and recommendation systems

Foundation for semantic search + classification pipelines


⚠️ Limitations

Rare labels have limited representation

Threshold-based predictions may require tuning per use case

Model is domain-specific to scikit-learn GitHub issues


πŸ›£οΈ Future Work

Joint semantic search + multilabel prediction


πŸ‘€ Author

Talip7