AnnyNguyen commited on
Commit
8e2a8d7
·
verified ·
1 Parent(s): 864ba07

Delete README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -172
README.md DELETED
@@ -1,172 +0,0 @@
1
- ---
2
- license: mit
3
- base_model: unknown
4
- tags:
5
- - vietnamese
6
- - hate-speech-detection
7
- - text-classification
8
- - offensive-language-detection
9
- datasets:
10
- - visolex/vihsd
11
- metrics:
12
- - accuracy
13
- - macro-f1
14
- - weighted-f1
15
- model-index:
16
- - name: bilstm-hsd
17
- results:
18
- - task:
19
- type: text-classification
20
- name: Hate Speech Detection
21
- dataset:
22
- name: ViHSD
23
- type: hate-speech-detection
24
- metrics:
25
- - type: accuracy
26
- value: 0.8388
27
- - type: macro-f1
28
- value: 0.3041
29
- - type: weighted-f1
30
- value: 0.7652
31
- - type: macro-precision
32
- value: 0.2796
33
- - type: macro-recall
34
- value: 0.3333
35
- ---
36
-
37
- # BILSTM: Hate Speech Detection for Vietnamese Text
38
-
39
- This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
40
- on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
41
-
42
- ## Model Details
43
-
44
- * **Base Model**: unknown
45
- * **Description**: bilstm fine-tuned for Vietnamese Hate Speech Detection
46
- * **Architecture**: Unknown
47
- * **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
48
- * **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
49
- * **Task**: Hate Speech Classification (3 classes)
50
-
51
- ### Hyperparameters
52
-
53
- * **Batch size**: `32`
54
- * **Learning rate**: `2e-5`
55
- * **Epochs**: `100`
56
- * **Max sequence length**: `256`
57
- * **Weight decay**: `0.01`
58
- * **Warmup steps**: `500`
59
- * **Early stopping patience**: `5`
60
- * **Optimizer**: AdamW
61
- * **Learning rate scheduler**: Cosine with warmup
62
-
63
- ## Dataset
64
-
65
- Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
66
-
67
- ### Label Descriptions:
68
-
69
- * **CLEAN (0)**: Normal content without offensive language
70
- * **OFFENSIVE (1)**: Mildly offensive or inappropriate content
71
- * **HATE (2)**: Hate speech, extremist language, severe threats
72
-
73
- ## Evaluation Results
74
-
75
- The model was evaluated on test set with the following metrics:
76
-
77
- * **Accuracy**: `0.8388`
78
- * **Macro-F1**: `0.3041`
79
- * **Weighted-F1**: `0.7652`
80
- * **Macro-Precision**: `0.2796`
81
- * **Macro-Recall**: `0.3333`
82
-
83
- ### Basic Usage
84
-
85
- **⚠️ Important**: This model uses custom architecture. You must use `trust_remote_code=True` when loading.
86
-
87
- ```python
88
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
89
- import torch
90
-
91
- # Load model and tokenizer
92
- model_name = "visolex/bilstm-hsd"
93
-
94
- # Load tokenizer
95
- # Note: For vocab-based models (bilstm, textcnn), use base model tokenizer or custom tokenization
96
- if "bilstm" in ["bilstm", "textcnn"]:
97
- # These models use custom vocabulary - tokenizer from base model may not work
98
- # You need to implement custom tokenization based on the model's vocabulary
99
- print("⚠️ Note: This model uses custom vocabulary-based tokenization")
100
- print(" Please refer to the model's documentation for tokenization details")
101
- tokenizer = None
102
- else:
103
- # Load tokenizer from the model repo (it will use base model's tokenizer)
104
- tokenizer = AutoTokenizer.from_pretrained(model_name)
105
-
106
- # Load model with trust_remote_code=True (REQUIRED for custom models)
107
- model = AutoModelForSequenceClassification.from_pretrained(
108
- model_name,
109
- trust_remote_code=True # ⚠️ REQUIRED: Allows loading custom model classes from models.py
110
- )
111
-
112
- # Classify text
113
- if tokenizer is not None:
114
- text = "Văn bản tiếng Việt cần phân loại"
115
- inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
116
-
117
- with torch.no_grad():
118
- outputs = model(**inputs)
119
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
120
- predicted_label = torch.argmax(predictions, dim=-1).item()
121
-
122
- # Label mapping
123
- label_names = {
124
- 0: "CLEAN",
125
- 1: "OFFENSIVE",
126
- 2: "HATE"
127
- }
128
-
129
- print(f"Predicted label: {label_names[predicted_label]}")
130
- print(f"Confidence scores: {predictions[0].tolist()}")
131
- else:
132
- print("Please implement custom tokenization for this vocab-based model")
133
- ```
134
-
135
- ## Training Details
136
-
137
- ### Training Data
138
- - **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
139
- - **Total samples**: ~10,000 Vietnamese comments from social media
140
- - **Training split**: ~70%
141
- - **Validation split**: ~15%
142
- - **Test split**: ~15%
143
-
144
- ### Training Configuration
145
- - **Framework**: PyTorch + HuggingFace Transformers
146
- - **Optimizer**: AdamW
147
- - **Learning Rate**: 2e-5
148
- - **Batch Size**: 32
149
- - **Max Length**: 256 tokens
150
- - **Epochs**: 100 (with early stopping patience: 5)
151
- - **Weight Decay**: 0.01
152
- - **Warmup Steps**: 500
153
-
154
-
155
- ## Contact & Support
156
-
157
- - **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
158
- - **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
159
- - **Questions**: Open a discussion on the model's Hugging Face page
160
-
161
- ## License
162
-
163
- This model is distributed under the MIT License.
164
-
165
- ## Acknowledgments
166
-
167
- - Base model: [unknown](https://huggingface.co/unknown)
168
- - Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
169
- - Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
170
- - ViSoLex Toolkit
171
-
172
- ---