AnnyNguyen commited on
Commit
27688b9
·
verified ·
1 Parent(s): 724f566

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +157 -0
README.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: unknown
4
+ tags:
5
+ - vietnamese
6
+ - hate-speech-detection
7
+ - text-classification
8
+ - offensive-language-detection
9
+ datasets:
10
+ - visolex/vihsd
11
+ metrics:
12
+ - accuracy
13
+ - macro-f1
14
+ - weighted-f1
15
+ model-index:
16
+ - name: bilstm-hsd
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Hate Speech Detection
21
+ dataset:
22
+ name: ViHSD
23
+ type: hate-speech-detection
24
+ metrics:
25
+ - type: accuracy
26
+ value: 0.8388
27
+ - type: macro-f1
28
+ value: 0.3041
29
+ - type: weighted-f1
30
+ value: 0.7652
31
+ - type: macro-precision
32
+ value: 0.2796
33
+ - type: macro-recall
34
+ value: 0.3333
35
+ ---
36
+
37
+ # BILSTM: Hate Speech Detection for Vietnamese Text
38
+
39
+ This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
40
+ on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
41
+
42
+ ## Model Details
43
+
44
+ * **Base Model**: unknown
45
+ * **Description**: bilstm fine-tuned for Vietnamese Hate Speech Detection
46
+ * **Architecture**: Unknown
47
+ * **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
48
+ * **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
49
+ * **Task**: Hate Speech Classification (3 classes)
50
+
51
+ ### Hyperparameters
52
+
53
+ * **Batch size**: `32`
54
+ * **Learning rate**: `2e-5`
55
+ * **Epochs**: `100`
56
+ * **Max sequence length**: `256`
57
+ * **Weight decay**: `0.01`
58
+ * **Warmup steps**: `500`
59
+ * **Early stopping patience**: `5`
60
+ * **Optimizer**: AdamW
61
+ * **Learning rate scheduler**: Cosine with warmup
62
+
63
+ ## Dataset
64
+
65
+ Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
66
+
67
+ ### Label Descriptions:
68
+
69
+ * **CLEAN (0)**: Normal content without offensive language
70
+ * **OFFENSIVE (1)**: Mildly offensive or inappropriate content
71
+ * **HATE (2)**: Hate speech, extremist language, severe threats
72
+
73
+ ## Evaluation Results
74
+
75
+ The model was evaluated on test set with the following metrics:
76
+
77
+ * **Accuracy**: `0.8388`
78
+ * **Macro-F1**: `0.3041`
79
+ * **Weighted-F1**: `0.7652`
80
+ * **Macro-Precision**: `0.2796`
81
+ * **Macro-Recall**: `0.3333`
82
+
83
+ ### Basic Usage
84
+
85
+ ```python
86
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
87
+ import torch
88
+
89
+ # Load model and tokenizer
90
+ model_name = "visolex/bilstm-hsd"
91
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
92
+ model = AutoModelForSequenceClassification.from_pretrained(
93
+ model_name
94
+ )
95
+
96
+ # Classify text
97
+ text = "Văn bản tiếng Việt cần phân loại"
98
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
99
+
100
+ with torch.no_grad():
101
+ outputs = model(**inputs)
102
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
103
+ predicted_label = torch.argmax(predictions, dim=-1).item()
104
+
105
+ # Label mapping
106
+ label_names = {
107
+ 0: "CLEAN",
108
+ 1: "OFFENSIVE",
109
+ 2: "HATE"
110
+ }
111
+
112
+ print(f"Predicted label: {label_names[predicted_label]}")
113
+ print(f"Confidence scores: {predictions[0].tolist()}")
114
+ ```
115
+
116
+
117
+ **⚠️ Note for Vocab-based Models**: This model (`bilstm`) uses custom vocabulary-based tokenization and does not include a Hugging Face tokenizer. You will need to implement custom tokenization or load a tokenizer from a compatible base model. The model expects word-level tokenized input.
118
+
119
+
120
+ ## Training Details
121
+
122
+ ### Training Data
123
+ - **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
124
+ - **Total samples**: ~10,000 Vietnamese comments from social media
125
+ - **Training split**: ~70%
126
+ - **Validation split**: ~15%
127
+ - **Test split**: ~15%
128
+
129
+ ### Training Configuration
130
+ - **Framework**: PyTorch + HuggingFace Transformers
131
+ - **Optimizer**: AdamW
132
+ - **Learning Rate**: 2e-5
133
+ - **Batch Size**: 32
134
+ - **Max Length**: 256 tokens
135
+ - **Epochs**: 100 (with early stopping patience: 5)
136
+ - **Weight Decay**: 0.01
137
+ - **Warmup Steps**: 500
138
+
139
+
140
+ ## Contact & Support
141
+
142
+ - **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
143
+ - **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
144
+ - **Questions**: Open a discussion on the model's Hugging Face page
145
+
146
+ ## License
147
+
148
+ This model is distributed under the MIT License.
149
+
150
+ ## Acknowledgments
151
+
152
+ - Base model: [unknown](https://huggingface.co/unknown)
153
+ - Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
154
+ - Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
155
+ - ViSoLex Toolkit
156
+
157
+ ---