File size: 3,436 Bytes
5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 5b8622f d0ee7f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
language: en
datasets:
- jigsaw-toxic-comment-classification-challenge
tags:
- text-classification
- multi-label-classification
- toxicity-detection
- bert
- transformers
- pytorch
license: apache-2.0
model-index:
- name: BERT Multi-label Toxic Comment Classifier
results:
- task:
name: Multi-label Text Classification
type: multi-label-classification
dataset:
name: Jigsaw Toxic Comment Classification Challenge
type: jigsaw-toxic-comment-classification-challenge
metrics:
- name: Accuracy
type: accuracy
value: 0.9187 # Replace with your actual score
---
# BERT Multi-label Toxic Comment Classifier
This model is a fine-tuned [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) transformer for **multi-label classification** on the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset.
It predicts multiple toxicity-related labels per comment, including:
- toxicity
- severe toxicity
- obscene
- threat
- insult
- identity attack
- sexual explicit
## Model Details
- **Base Model**: `bert-base-uncased`
- **Task**: Multi-label text classification
- **Dataset**: Jigsaw Toxic Comment Classification Challenge (processed version)
- **Labels**: 7 toxicity-related categories
- **Training Epochs**: 2
- **Batch Size**: 16 (train), 64 (eval)
- **Metrics**: Accuracy, Macro F1, Precision, Recall
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
model = AutoModelForSequenceClassification.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
text = "You are a wonderful person!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
# Sigmoid to get probabilities for each label
import torch
probs = torch.sigmoid(outputs.logits)
print(probs)
````
## Labels
| Index | Label |
| ----- | ---------------- |
| 0 | toxicity |
| 1 | severe_toxicity |
| 2 | obscene |
| 3 | threat |
| 4 | insult |
| 5 | identity_attack |
| 6 | sexual_explicit |
## Training Details
* Training Set: Full dataset (160k+ samples)
* Loss Function: Binary Cross Entropy (via `BertForSequenceClassification` with `problem_type="multi_label_classification"`)
* Optimizer: AdamW
* Learning Rate: 2e-5
* Evaluation Strategy: Epoch-based evaluation with early stopping on F1 score
* Model Framework: PyTorch with Hugging Face Transformers
## Repository Contents
* `pytorch_model.bin` - trained model weights
* `config.json` - model configuration
* `tokenizer.json`, `vocab.txt` - tokenizer files
* `README.md` - this file
## How to Fine-tune or Train
You can fine-tune this model using the Hugging Face `Trainer` API with your own dataset or the original Jigsaw dataset.
## Citation
If you use this model in your research or project, please cite:
```
@article{devlin2019bert,
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1810.04805},
year={2019}
}
```
## License
Apache 2.0 License
|