Koushim commited on
Commit
5b8622f
·
verified ·
1 Parent(s): 32ce6d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - jigsaw-toxic-comment-classification-challenge
5
+ tags:
6
+ - multi-label-classification
7
+ - toxicity-detection
8
+ - bert
9
+ - transformers
10
+ - pytorch
11
+ license: apache-2.0
12
+ model-index:
13
+ - name: BERT Multi-label Toxic Comment Classifier
14
+ results:
15
+ - task:
16
+ name: Multi-label Text Classification
17
+ type: multi-label-classification
18
+ dataset:
19
+ name: Jigsaw Toxic Comment Classification Challenge
20
+ type: jigsaw-toxic-comment-classification-challenge
21
+ metrics:
22
+ - name: F1 Score (Macro)
23
+ type: f1
24
+ value: 0.XX # Replace with your actual score
25
+ - name: Accuracy
26
+ type: accuracy
27
+ value: 0.XX # Replace with your actual score
28
+ ---
29
+
30
+ # BERT Multi-label Toxic Comment Classifier
31
+
32
+ This model is a fine-tuned [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) transformer for **multi-label classification** on the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset.
33
+
34
+ It predicts multiple toxicity-related labels per comment, including:
35
+ - toxicity
36
+ - severe toxicity
37
+ - obscene
38
+ - threat
39
+ - insult
40
+ - identity attack
41
+ - sexual explicit
42
+
43
+ ## Model Details
44
+
45
+ - **Base Model**: `bert-base-uncased`
46
+ - **Task**: Multi-label text classification
47
+ - **Dataset**: Jigsaw Toxic Comment Classification Challenge (processed version)
48
+ - **Labels**: 7 toxicity-related categories
49
+ - **Training Epochs**: 2
50
+ - **Batch Size**: 16 (train), 64 (eval)
51
+ - **Metrics**: Accuracy, Macro F1, Precision, Recall
52
+
53
+ ## Usage
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
59
+ model = AutoModelForSequenceClassification.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
60
+
61
+ text = "You are a wonderful person!"
62
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
63
+ outputs = model(**inputs)
64
+
65
+ # Sigmoid to get probabilities for each label
66
+ import torch
67
+ probs = torch.sigmoid(outputs.logits)
68
+ print(probs)
69
+ ````
70
+
71
+ ## Labels
72
+
73
+ | Index | Label |
74
+ | ----- | ---------------- |
75
+ | 0 | toxicity |
76
+ | 1 | severe_toxicity |
77
+ | 2 | obscene |
78
+ | 3 | threat |
79
+ | 4 | insult |
80
+ | 5 | identity_attack |
81
+ | 6 | sexual_explicit |
82
+
83
+ ## Training Details
84
+
85
+ * Loss Function: Binary Cross Entropy (via `BertForSequenceClassification` with `problem_type="multi_label_classification"`)
86
+ * Optimizer: AdamW
87
+ * Learning Rate: 2e-5
88
+ * Evaluation Strategy: Epoch-based evaluation with early stopping on F1 score
89
+ * Model Framework: PyTorch with Hugging Face Transformers
90
+
91
+ ## Repository Contents
92
+
93
+ * `pytorch_model.bin` - trained model weights
94
+ * `config.json` - model configuration
95
+ * `tokenizer.json`, `vocab.txt` - tokenizer files
96
+ * `README.md` - this file
97
+
98
+ ## How to Fine-tune or Train
99
+
100
+ You can fine-tune this model using the Hugging Face `Trainer` API with your own dataset or the original Jigsaw dataset.
101
+
102
+ ## Citation
103
+
104
+ If you use this model in your research or project, please cite:
105
+
106
+ ```
107
+ @article{devlin2019bert,
108
+ title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
109
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
110
+ journal={arXiv preprint arXiv:1810.04805},
111
+ year={2019}
112
+ }
113
+ ```
114
+
115
+ ## License
116
+
117
+ Apache 2.0 License