Talip7
/

scikit-learn-multilabel-classifier

Safetensors

distilbert

Model card Files Files and versions

xet

Community

Talip7 commited on Jan 4

Commit

0331aa3

verified ·

1 Parent(s): e0d6c35

Update README.md

Browse files

Files changed (1) hide show

README.md +115 -43

README.md CHANGED Viewed

@@ -1,62 +1,134 @@
 ---
-library_name: transformers
-license: apache-2.0
-base_model: distilbert-base-uncased
-tags:
-- generated_from_trainer
-model-index:
-- name: results
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# results
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.0219
-- Micro F1: 0.8570
-- Macro F1: 0.2635
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 16
-- eval_batch_size: 16
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 3
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Micro F1 | Macro F1 |
-|:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|
-| 0.0383        | 1.0   | 703  | 0.0328          | 0.7591   | 0.1626   |
-| 0.0279        | 2.0   | 1406 | 0.0245          | 0.8379   | 0.2323   |
-| 0.0237        | 3.0   | 2109 | 0.0219          | 0.8570   | 0.2635   |
-### Framework versions
-- Transformers 4.57.3
-- Pytorch 2.9.0+cu126
-- Datasets 4.0.0
-- Tokenizers 0.22.1

+# 🧠 Scikit-learn GitHub Issues – Multilabel Classifier
+This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.
+The model is suitable for:
+- automated issue triage
+- label recommendation
+- downstream semantic search and filtering pipelines
+---
+## 🔍 Task
+**Multilabel Text Classification**
+Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).
+The model predicts **all relevant labels** for a given issue text.
+---
+## 📦 Dataset
+- **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository
+- **Collection method**: Custom GitHub REST API pipeline
+- **Preprocessing steps**:
+  - Included **open and closed issues**
+  - Excluded **pull requests**
+  - Retrieved **all issue comments**
+  - Exploded comments so each sample contains:
+    - issue title
+    - issue body
+    - comments
+  - Converted labels to **multi-hot vectors**
+- **Dataset on Hugging Face**:
+  👉 https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel
+**Final dataset size**: ~12,000 samples
+**Number of unique labels**: ~20+
+---
+## 🧱 Model
+- **Base model**: `distilbert-base-uncased`
+- **Architecture**: `AutoModelForSequenceClassification`
+- **Problem type**: `multi_label_classification`
+- **Loss function**: Binary Cross Entropy with Logits
+- **Activation**: Sigmoid
+- **Prediction threshold**: 0.5
 ---
+## 📊 Evaluation Metrics
+| Metric     | Score |
+|-----------|-------|
+| Micro F1  | **0.857** |
+| Macro F1  | 0.263 |
+| Epochs    | 3 |
+**Notes**:
+- Micro F1 reflects strong overall performance.
+- Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.
 ---
+## 🧪 Training Details
+- Optimizer: AdamW
+- Learning rate: 2e-5
+- Batch size: 16
+- Max sequence length: 256
+- Validation split: 10%
+- Best model selection: micro-F1
+- Trained on GPU
+---
+## 🚀 Inference Example
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="Talip7/scikit-learn-multilabel-classifier",
+    return_all_scores=True
+)
+text = """
+Bug occurs in LinearRegression when sample_weight is used.
+The issue happens after upgrading numpy.
+"""
+outputs = classifier(text)
+labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
+print(labels)
+```
+---
+## 🔗 Intended Use
+Automated GitHub issue labeling
+Developer productivity tools
+Search and recommendation systems
+Foundation for semantic search + classification pipelines
+---
+## ⚠️ Limitations
+Rare labels have limited representation
+Threshold-based predictions may require tuning per use case
+Model is domain-specific to scikit-learn GitHub issues
+---
+## 🛣️ Future Work
+Joint semantic search + multilabel prediction
+---
+## 👤 Author
+Talip7