File size: 2,914 Bytes
0331aa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0d6c35
0331aa3
 
 
 
 
 
 
 
 
 
 
 
 
e0d6c35
 
0331aa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0d6c35
0331aa3
 
 
 
 
e0d6c35
0331aa3
 
 
 
e0d6c35
0331aa3
e0d6c35
0331aa3
 
 
e0d6c35
0331aa3
 
 
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
e0d6c35
0331aa3
 
 
 
 
 
 
 
 
e0d6c35
0331aa3
e0d6c35
0331aa3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# 🧠 Scikit-learn GitHub Issues – Multilabel Classifier

This repository contains a **multilabel text classification model** trained to predict GitHub issue labels for the **scikit-learn** project based on issue text and comments.

The model is suitable for:
- automated issue triage
- label recommendation
- downstream semantic search and filtering pipelines

---

## πŸ” Task

**Multilabel Text Classification**

Each GitHub issue can have **multiple labels** (e.g. `Bug`, `Documentation`, `module:linear_model`).  
The model predicts **all relevant labels** for a given issue text.

---

## πŸ“¦ Dataset

- **Source**: GitHub Issues from the `scikit-learn/scikit-learn` repository  
- **Collection method**: Custom GitHub REST API pipeline  
- **Preprocessing steps**:
  - Included **open and closed issues**
  - Excluded **pull requests**
  - Retrieved **all issue comments**
  - Exploded comments so each sample contains:
    - issue title
    - issue body
    - comments
  - Converted labels to **multi-hot vectors**

- **Dataset on Hugging Face**:  
  πŸ‘‰ https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel

**Final dataset size**: ~12,000 samples  
**Number of unique labels**: ~20+

---

## 🧱 Model

- **Base model**: `distilbert-base-uncased`
- **Architecture**: `AutoModelForSequenceClassification`
- **Problem type**: `multi_label_classification`
- **Loss function**: Binary Cross Entropy with Logits
- **Activation**: Sigmoid
- **Prediction threshold**: 0.5

---

## πŸ“Š Evaluation Metrics

| Metric     | Score |
|-----------|-------|
| Micro F1  | **0.857** |
| Macro F1  | 0.263 |
| Epochs    | 3 |

**Notes**:
- Micro F1 reflects strong overall performance.
- Lower Macro F1 is expected due to **severe label imbalance**, common in real-world GitHub issue datasets.

---

## πŸ§ͺ Training Details

- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 256
- Validation split: 10%
- Best model selection: micro-F1
- Trained on GPU

---

## πŸš€ Inference Example

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Talip7/scikit-learn-multilabel-classifier",
    return_all_scores=True
)

text = """
Bug occurs in LinearRegression when sample_weight is used.
The issue happens after upgrading numpy.
"""

outputs = classifier(text)

labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
print(labels)
```

---

## πŸ”— Intended Use

Automated GitHub issue labeling

Developer productivity tools

Search and recommendation systems

Foundation for semantic search + classification pipelines

---

## ⚠️ Limitations

Rare labels have limited representation

Threshold-based predictions may require tuning per use case

Model is domain-specific to scikit-learn GitHub issues

---

## πŸ›£οΈ Future Work

Joint semantic search + multilabel prediction

---

## πŸ‘€ Author

Talip7