AnnyNguyen commited on
Commit
a039170
·
verified ·
1 Parent(s): 0904f51

Delete README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -155
README.md DELETED
@@ -1,155 +0,0 @@
1
- ---
2
- license: mit
3
- base_model: vinai/bartpho-syllable-base
4
- tags:
5
- - vietnamese
6
- - hate-speech-detection
7
- - text-classification
8
- - offensive-language-detection
9
- datasets:
10
- - visolex/vihsd
11
- metrics:
12
- - accuracy
13
- - macro-f1
14
- - weighted-f1
15
- model-index:
16
- - name: bartpho-hsd
17
- results:
18
- - task:
19
- type: text-classification
20
- name: Hate Speech Detection
21
- dataset:
22
- name: ViHSD
23
- type: hate-speech-detection
24
- metrics:
25
- - type: accuracy
26
- value: 0.8985
27
- - type: macro-f1
28
- value: 0.6791
29
- - type: weighted-f1
30
- value: 0.8886
31
- - type: macro-precision
32
- value: 0.7664
33
- - type: macro-recall
34
- value: 0.6289
35
- ---
36
-
37
- # BARTpho: Hate Speech Detection for Vietnamese Text
38
-
39
- This model is a fine-tuned version of [vinai/bartpho-syllable-base](https://huggingface.co/vinai/bartpho-syllable-base)
40
- on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
41
-
42
- ## Model Details
43
-
44
- * **Base Model**: vinai/bartpho-syllable-base
45
- * **Description**: BARTpho fine-tuned cho bài toán phân loại Hate Speech tiếng Việt
46
- * **Architecture**: BARTpho (Bidirectional and Auto-Regressive Transformer cho tiếng Việt)
47
- * **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
48
- * **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
49
- * **Task**: Hate Speech Classification (3 classes)
50
-
51
- ### Hyperparameters
52
-
53
- * **Batch size**: `32`
54
- * **Learning rate**: `2e-5`
55
- * **Epochs**: `100`
56
- * **Max sequence length**: `256`
57
- * **Weight decay**: `0.01`
58
- * **Warmup steps**: `500`
59
- * **Early stopping patience**: `5`
60
- * **Optimizer**: AdamW
61
- * **Learning rate scheduler**: Cosine with warmup
62
-
63
- ## Dataset
64
-
65
- Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
66
-
67
- ### Label Descriptions:
68
-
69
- * **CLEAN (0)**: Normal content without offensive language
70
- * **OFFENSIVE (1)**: Mildly offensive or inappropriate content
71
- * **HATE (2)**: Hate speech, extremist language, severe threats
72
-
73
- ## Evaluation Results
74
-
75
- The model was evaluated on test set with the following metrics:
76
-
77
- * **Accuracy**: `0.8985`
78
- * **Macro-F1**: `0.6791`
79
- * **Weighted-F1**: `0.8886`
80
- * **Macro-Precision**: `0.7664`
81
- * **Macro-Recall**: `0.6289`
82
-
83
- ### Basic Usage
84
-
85
- ```python
86
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
87
- import torch
88
-
89
- # Load model and tokenizer
90
- model_name = "visolex/bartpho-hsd"
91
- tokenizer = AutoTokenizer.from_pretrained(model_name)
92
- model = AutoModelForSequenceClassification.from_pretrained(
93
- model_name
94
- )
95
-
96
- # Classify text
97
- text = "Văn bản tiếng Việt cần phân loại"
98
- inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
99
-
100
- with torch.no_grad():
101
- outputs = model(**inputs)
102
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
103
- predicted_label = torch.argmax(predictions, dim=-1).item()
104
-
105
- # Label mapping
106
- label_names = {
107
- 0: "CLEAN",
108
- 1: "OFFENSIVE",
109
- 2: "HATE"
110
- }
111
-
112
- print(f"Predicted label: {label_names[predicted_label]}")
113
- print(f"Confidence scores: {predictions[0].tolist()}")
114
- ```
115
-
116
-
117
-
118
- ## Training Details
119
-
120
- ### Training Data
121
- - **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
122
- - **Total samples**: ~10,000 Vietnamese comments from social media
123
- - **Training split**: ~70%
124
- - **Validation split**: ~15%
125
- - **Test split**: ~15%
126
-
127
- ### Training Configuration
128
- - **Framework**: PyTorch + HuggingFace Transformers
129
- - **Optimizer**: AdamW
130
- - **Learning Rate**: 2e-5
131
- - **Batch Size**: 32
132
- - **Max Length**: 256 tokens
133
- - **Epochs**: 100 (with early stopping patience: 5)
134
- - **Weight Decay**: 0.01
135
- - **Warmup Steps**: 500
136
-
137
-
138
- ## Contact & Support
139
-
140
- - **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
141
- - **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
142
- - **Questions**: Open a discussion on the model's Hugging Face page
143
-
144
- ## License
145
-
146
- This model is distributed under the MIT License.
147
-
148
- ## Acknowledgments
149
-
150
- - Base model: [vinai/bartpho-syllable-base](https://huggingface.co/vinai/bartpho-syllable-base)
151
- - Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
152
- - Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
153
- - ViSoLex Toolkit
154
-
155
- ---