# ๐Ÿง  Scikit-learn GitHub Issues โ€“ Multilabel Classifier This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments. The model is suitable for: - automated issue triage - label recommendation - downstream semantic search and filtering pipelines --- ## ๐Ÿ” Task **Multilabel Text Classification** Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`). The model predicts **all relevant labels** for a given issue text. --- ## ๐Ÿ“ฆ Dataset - **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository - **Collection method**: Custom GitHub REST API pipeline - **Preprocessing steps**: - Included **open and closed issues** - Excluded **pull requests** - Retrieved **all issue comments** - Exploded comments so each sample contains: - issue title - issue body - comments - Converted labels to **multi-hot vectors** - **Dataset on Hugging Face**: ๐Ÿ‘‰ https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel **Final dataset size**: ~12,000 samples **Number of unique labels**: ~20+ --- ## ๐Ÿงฑ Model - **Base model**: `distilbert-base-uncased` - **Architecture**: `AutoModelForSequenceClassification` - **Problem type**: `multi_label_classification` - **Loss function**: Binary Cross Entropy with Logits - **Activation**: Sigmoid - **Prediction threshold**: 0.5 --- ## ๐Ÿ“Š Evaluation Metrics | Metric | Score | |-----------|-------| | Micro F1 | **0.857** | | Macro F1 | 0.263 | | Epochs | 3 | **Notes**: - Micro F1 reflects strong overall performance. - Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets. --- ## ๐Ÿงช Training Details - Optimizer: AdamW - Learning rate: 2e-5 - Batch size: 16 - Max sequence length: 256 - Validation split: 10% - Best model selection: micro-F1 - Trained on GPU --- ## ๐Ÿš€ Inference Example ```python from transformers import pipeline classifier = pipeline( "text-classification", model="Talip7/scikit-learn-multilabel-classifier", return_all_scores=True ) text = """ Bug occurs in LinearRegression when sample_weight is used. The issue happens after upgrading numpy. """ outputs = classifier(text) labels = [o["label"] for o in outputs[0] if o["score"] > 0.5] print(labels) ``` --- ## ๐Ÿ”— Intended Use Automated GitHub issue labeling Developer productivity tools Search and recommendation systems Foundation for semantic search + classification pipelines --- ## โš ๏ธ Limitations Rare labels have limited representation Threshold-based predictions may require tuning per use case Model is domain-specific to scikit-learn GitHub issues --- ## ๐Ÿ›ฃ๏ธ Future Work Joint semantic search + multilabel prediction --- ## ๐Ÿ‘ค Author Talip7