Talip7
/

scikit-learn-multilabel-classifier

Model card Files Files and versions

scikit-learn-multilabel-classifier / README.md

Talip7's picture

Update README.md

0331aa3 verified 5 months ago

|

history blame contribute delete

2.91 kB

	# 🧠 Scikit-learn GitHub Issues – Multilabel Classifier

	This repository contains a multilabel text classification model trained to predict GitHub issue labels for the scikit-learn project based on issue text and comments.

	The model is suitable for:
	- automated issue triage
	- label recommendation
	- downstream semantic search and filtering pipelines

	---

	## 🔍 Task

	Multilabel Text Classification

	Each GitHub issue can have multiple labels (e.g. `Bug`, `Documentation`, `module:linear_model`).
	The model predicts all relevant labels for a given issue text.

	---

	## 📦 Dataset

	- Source: GitHub Issues from the `scikit-learn/scikit-learn` repository
	- Collection method: Custom GitHub REST API pipeline
	- Preprocessing steps:
	- Included open and closed issues
	- Excluded pull requests
	- Retrieved all issue comments
	- Exploded comments so each sample contains:
	- issue title
	- issue body
	- comments
	- Converted labels to multi-hot vectors

	- Dataset on Hugging Face:
	👉 https://huggingface.co/datasets/Talip7/scikit-learn-issues-multilabel

	Final dataset size: ~12,000 samples
	Number of unique labels: ~20+

	---

	## 🧱 Model

	- Base model: `distilbert-base-uncased`
	- Architecture: `AutoModelForSequenceClassification`
	- Problem type: `multi_label_classification`
	- Loss function: Binary Cross Entropy with Logits
	- Activation: Sigmoid
	- Prediction threshold: 0.5

	---

	## 📊 Evaluation Metrics

	\| Metric \| Score \|
	\|-----------\|-------\|
	\| Micro F1 \| 0.857 \|
	\| Macro F1 \| 0.263 \|
	\| Epochs \| 3 \|

	Notes:
	- Micro F1 reflects strong overall performance.
	- Lower Macro F1 is expected due to severe label imbalance, common in real-world GitHub issue datasets.

	---

	## 🧪 Training Details

	- Optimizer: AdamW
	- Learning rate: 2e-5
	- Batch size: 16
	- Max sequence length: 256
	- Validation split: 10%
	- Best model selection: micro-F1
	- Trained on GPU

	---

	## 🚀 Inference Example

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="Talip7/scikit-learn-multilabel-classifier",
	return_all_scores=True
	)

	text = """
	Bug occurs in LinearRegression when sample_weight is used.
	The issue happens after upgrading numpy.
	"""

	outputs = classifier(text)

	labels = [o["label"] for o in outputs[0] if o["score"] > 0.5]
	print(labels)
	```

	---

	## 🔗 Intended Use

	Automated GitHub issue labeling

	Developer productivity tools

	Search and recommendation systems

	Foundation for semantic search + classification pipelines

	---

	## ⚠️ Limitations

	Rare labels have limited representation

	Threshold-based predictions may require tuning per use case

	Model is domain-specific to scikit-learn GitHub issues

	---

	## 🛣️ Future Work

	Joint semantic search + multilabel prediction

	---

	## 👤 Author

	Talip7