| # π§ Scikit-learn GitHub Issues β Multilabel Classifier |
|
|
| This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments. |
|
|
| The model is suitable for: |
| - automated issue triage |
| - label recommendation |
| - downstream semantic search and filtering pipelines |
|
|
| --- |
|
|
| ## π Task |
|
|
| **Multilabel Text Classification** |
|
|
| Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`). |
| The model predicts **all relevant labels** for a given issue text. |
|
|
| --- |
|
|
| ## π¦ Dataset |
|
|
| - **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository |
| - **Collection method**: Custom GitHub REST API pipeline |
| - **Preprocessing steps**: |
| - Included **open and closed issues** |
| - Excluded **pull requests** |
| - Retrieved **all issue comments** |
| - Exploded comments so each sample contains: |
| - issue title |
| - issue body |
| - comments |
| - Converted labels to **multi-hot vectors** |
|
|
| - **Dataset on Hugging Face**: |
| π https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel |
|
|
| **Final dataset size**: ~12,000 samples |
| **Number of unique labels**: ~20+ |
|
|
| --- |
|
|
| ## π§± Model |
|
|
| - **Base model**: `distilbert-base-uncased` |
| - **Architecture**: `AutoModelForSequenceClassification` |
| - **Problem type**: `multi_label_classification` |
| - **Loss function**: Binary Cross Entropy with Logits |
| - **Activation**: Sigmoid |
| - **Prediction threshold**: 0.5 |
|
|
| --- |
|
|
| ## π Evaluation Metrics |
|
|
| | Metric | Score | |
| |-----------|-------| |
| | Micro F1 | **0.857** | |
| | Macro F1 | 0.263 | |
| | Epochs | 3 | |
|
|
| **Notes**: |
| - Micro F1 reflects strong overall performance. |
| - Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets. |
|
|
| --- |
|
|
| ## π§ͺ Training Details |
|
|
| - Optimizer: AdamW |
| - Learning rate: 2e-5 |
| - Batch size: 16 |
| - Max sequence length: 256 |
| - Validation split: 10% |
| - Best model selection: micro-F1 |
| - Trained on GPU |
|
|
| --- |
|
|
| ## π Inference Example |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline( |
| "text-classification", |
| model="Talip7/scikit-learn-multilabel-classifier", |
| return_all_scores=True |
| ) |
| |
| text = """ |
| Bug occurs in LinearRegression when sample_weight is used. |
| The issue happens after upgrading numpy. |
| """ |
| |
| outputs = classifier(text) |
| |
| labels = [o["label"] for o in outputs[0] if o["score"] > 0.5] |
| print(labels) |
| ``` |
|
|
| --- |
|
|
| ## π Intended Use |
|
|
| Automated GitHub issue labeling |
|
|
| Developer productivity tools |
|
|
| Search and recommendation systems |
|
|
| Foundation for semantic search + classification pipelines |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| Rare labels have limited representation |
|
|
| Threshold-based predictions may require tuning per use case |
|
|
| Model is domain-specific to scikit-learn GitHub issues |
|
|
| --- |
|
|
| ## π£οΈ Future Work |
|
|
| Joint semantic search + multilabel prediction |
|
|
| --- |
|
|
| ## π€ Author |
|
|
| Talip7 |