AnnyNguyen commited on
Commit
a81bea1
·
verified ·
1 Parent(s): 4cb7aff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: unknown
4
+ tags:
5
+ - vietnamese
6
+ - hate-speech-detection
7
+ - text-classification
8
+ - offensive-language-detection
9
+ datasets:
10
+ - visolex/vihsd
11
+ metrics:
12
+ - accuracy
13
+ - macro-f1
14
+ - weighted-f1
15
+ model-index:
16
+ - name: bilstm-hsd
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Hate Speech Detection
21
+ dataset:
22
+ name: ViHSD
23
+ type: hate-speech-detection
24
+ metrics:
25
+ - type: accuracy
26
+ value: 0.8388
27
+ - type: macro-f1
28
+ value: 0.3041
29
+ - type: weighted-f1
30
+ value: 0.7652
31
+ - type: macro-precision
32
+ value: 0.2796
33
+ - type: macro-recall
34
+ value: 0.3333
35
+ ---
36
+
37
+ # BILSTM: Hate Speech Detection for Vietnamese Text
38
+
39
+ This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
40
+ on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
41
+
42
+ ## Model Details
43
+
44
+ * **Base Model**: unknown
45
+ * **Description**: bilstm fine-tuned for Vietnamese Hate Speech Detection
46
+ * **Architecture**: Unknown
47
+ * **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
48
+ * **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
49
+ * **Task**: Hate Speech Classification (3 classes)
50
+
51
+ ### Hyperparameters
52
+
53
+ * **Batch size**: `32`
54
+ * **Learning rate**: `2e-5`
55
+ * **Epochs**: `100`
56
+ * **Max sequence length**: `256`
57
+ * **Weight decay**: `0.01`
58
+ * **Warmup steps**: `500`
59
+ * **Early stopping patience**: `5`
60
+ * **Optimizer**: AdamW
61
+ * **Learning rate scheduler**: Cosine with warmup
62
+
63
+ ## Dataset
64
+
65
+ Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
66
+
67
+ ### Label Descriptions:
68
+
69
+ * **CLEAN (0)**: Normal content without offensive language
70
+ * **OFFENSIVE (1)**: Mildly offensive or inappropriate content
71
+ * **HATE (2)**: Hate speech, extremist language, severe threats
72
+
73
+ ## Evaluation Results
74
+
75
+ The model was evaluated on test set with the following metrics:
76
+
77
+ * **Accuracy**: `0.8388`
78
+ * **Macro-F1**: `0.3041`
79
+ * **Weighted-F1**: `0.7652`
80
+ * **Macro-Precision**: `0.2796`
81
+ * **Macro-Recall**: `0.3333`
82
+
83
+ ### Basic Usage
84
+
85
+ ```python
86
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
87
+ import torch
88
+
89
+ # Load model and tokenizer
90
+ model_name = "visolex/bilstm-hsd"
91
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
92
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
93
+
94
+ # Classify text
95
+ text = "Văn bản tiếng Việt cần phân loại"
96
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
97
+
98
+ with torch.no_grad():
99
+ outputs = model(**inputs)
100
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
101
+ predicted_label = torch.argmax(predictions, dim=-1).item()
102
+
103
+ # Label mapping
104
+ label_names = {
105
+ 0: "CLEAN",
106
+ 1: "OFFENSIVE",
107
+ 2: "HATE"
108
+ }
109
+
110
+ print(f"Predicted label: {label_names[predicted_label]}")
111
+ print(f"Confidence scores: {predictions[0].tolist()}")
112
+ ```
113
+
114
+ ## Training Details
115
+
116
+ ### Training Data
117
+ - **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
118
+ - **Total samples**: ~10,000 Vietnamese comments from social media
119
+ - **Training split**: ~70%
120
+ - **Validation split**: ~15%
121
+ - **Test split**: ~15%
122
+
123
+ ### Training Configuration
124
+ - **Framework**: PyTorch + HuggingFace Transformers
125
+ - **Optimizer**: AdamW
126
+ - **Learning Rate**: 2e-5
127
+ - **Batch Size**: 32
128
+ - **Max Length**: 256 tokens
129
+ - **Epochs**: 100 (with early stopping patience: 5)
130
+ - **Weight Decay**: 0.01
131
+ - **Warmup Steps**: 500
132
+
133
+
134
+ ## Contact & Support
135
+
136
+ - **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
137
+ - **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
138
+ - **Questions**: Open a discussion on the model's Hugging Face page
139
+
140
+ ## License
141
+
142
+ This model is distributed under the MIT License.
143
+
144
+ ## Acknowledgments
145
+
146
+ - Base model: [unknown](https://huggingface.co/unknown)
147
+ - Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
148
+ - Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
149
+ - ViSoLex Toolkit
150
+
151
+ ---