Talip7 commited on
Commit
0331aa3
Β·
verified Β·
1 Parent(s): e0d6c35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -43
README.md CHANGED
@@ -1,62 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: distilbert-base-uncased
5
- tags:
6
- - generated_from_trainer
7
- model-index:
8
- - name: results
9
- results: []
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- # results
 
 
 
 
16
 
17
- This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.0219
20
- - Micro F1: 0.8570
21
- - Macro F1: 0.2635
22
 
23
- ## Model description
24
 
25
- More information needed
 
 
26
 
27
- ## Intended uses & limitations
 
 
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
32
 
33
- More information needed
34
 
35
- ## Training procedure
36
 
37
- ### Training hyperparameters
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 2e-05
41
- - train_batch_size: 16
42
- - eval_batch_size: 16
43
- - seed: 42
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: linear
46
- - num_epochs: 3
47
 
48
- ### Training results
49
 
50
- | Training Loss | Epoch | Step | Validation Loss | Micro F1 | Macro F1 |
51
- |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|
52
- | 0.0383 | 1.0 | 703 | 0.0328 | 0.7591 | 0.1626 |
53
- | 0.0279 | 2.0 | 1406 | 0.0245 | 0.8379 | 0.2323 |
54
- | 0.0237 | 3.0 | 2109 | 0.0219 | 0.8570 | 0.2635 |
55
 
 
 
 
 
 
 
 
 
 
56
 
57
- ### Framework versions
58
 
59
- - Transformers 4.57.3
60
- - Pytorch 2.9.0+cu126
61
- - Datasets 4.0.0
62
- - Tokenizers 0.22.1
 
1
+ # 🧠 Scikit-learn GitHub Issues – Multilabel Classifier
2
+
3
+ This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.
4
+
5
+ The model is suitable for:
6
+ - automated issue triage
7
+ - label recommendation
8
+ - downstream semantic search and filtering pipelines
9
+
10
+ ---
11
+
12
+ ## πŸ” Task
13
+
14
+ **Multilabel Text Classification**
15
+
16
+ Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).
17
+ The model predicts **all relevant labels** for a given issue text.
18
+
19
+ ---
20
+
21
+ ## πŸ“¦ Dataset
22
+
23
+ - **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository
24
+ - **Collection method**: Custom GitHub REST API pipeline
25
+ - **Preprocessing steps**:
26
+ - Included **open and closed issues**
27
+ - Excluded **pull requests**
28
+ - Retrieved **all issue comments**
29
+ - Exploded comments so each sample contains:
30
+ - issue title
31
+ - issue body
32
+ - comments
33
+ - Converted labels to **multi-hot vectors**
34
+
35
+ - **Dataset on Hugging Face**:
36
+ πŸ‘‰ https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel
37
+
38
+ **Final dataset size**: ~12,000 samples
39
+ **Number of unique labels**: ~20+
40
+
41
+ ---
42
+
43
+ ## 🧱 Model
44
+
45
+ - **Base model**: `distilbert-base-uncased`
46
+ - **Architecture**: `AutoModelForSequenceClassification`
47
+ - **Problem type**: `multi_label_classification`
48
+ - **Loss function**: Binary Cross Entropy with Logits
49
+ - **Activation**: Sigmoid
50
+ - **Prediction threshold**: 0.5
51
+
52
  ---
53
+
54
+ ## πŸ“Š Evaluation Metrics
55
+
56
+ | Metric | Score |
57
+ |-----------|-------|
58
+ | Micro F1 | **0.857** |
59
+ | Macro F1 | 0.263 |
60
+ | Epochs | 3 |
61
+
62
+ **Notes**:
63
+ - Micro F1 reflects strong overall performance.
64
+ - Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.
65
+
66
  ---
67
 
68
+ ## πŸ§ͺ Training Details
69
+
70
+ - Optimizer: AdamW
71
+ - Learning rate: 2e-5
72
+ - Batch size: 16
73
+ - Max sequence length: 256
74
+ - Validation split: 10%
75
+ - Best model selection: micro-F1
76
+ - Trained on GPU
77
+
78
+ ---
79
+
80
+ ## πŸš€ Inference Example
81
+
82
+ ```python
83
+ from transformers import pipeline
84
 
85
+ classifier = pipeline(
86
+ "text-classification",
87
+ model="Talip7/scikit-learn-multilabel-classifier",
88
+ return_all_scores=True
89
+ )
90
 
91
+ text = """
92
+ Bug occurs in LinearRegression when sample_weight is used.
93
+ The issue happens after upgrading numpy.
94
+ """
 
95
 
96
+ outputs = classifier(text)
97
 
98
+ labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
99
+ print(labels)
100
+ ```
101
 
102
+ ---
103
+
104
+ ## πŸ”— Intended Use
105
 
106
+ Automated GitHub issue labeling
107
 
108
+ Developer productivity tools
109
 
110
+ Search and recommendation systems
111
 
112
+ Foundation for semantic search + classification pipelines
113
 
114
+ ---
115
 
116
+ ## ⚠️ Limitations
 
 
 
 
 
 
 
117
 
118
+ Rare labels have limited representation
119
 
120
+ Threshold-based predictions may require tuning per use case
 
 
 
 
121
 
122
+ Model is domain-specific to scikit-learn GitHub issues
123
+
124
+ ---
125
+
126
+ ## πŸ›£οΈ Future Work
127
+
128
+ Joint semantic search + multilabel prediction
129
+
130
+ ---
131
 
132
+ ## πŸ‘€ Author
133
 
134
+ Talip7