visolex
/

bilstm-hsd

+---
+license: mit
+base_model: unknown
+tags:
+- vietnamese
+- hate-speech-detection
+- text-classification
+- offensive-language-detection
+datasets:
+- visolex/vihsd
+metrics:
+- accuracy
+- macro-f1
+- weighted-f1
+model-index:
+- name: bilstm-hsd
+  results:
+  - task:
+      type: text-classification
+      name: Hate Speech Detection
+    dataset:
+      name: ViHSD
+      type: hate-speech-detection
+    metrics:
+    - type: accuracy
+      value: 0.8388
+    - type: macro-f1
+      value: 0.3041
+    - type: weighted-f1
+      value: 0.7652
+    - type: macro-precision
+      value: 0.2796
+    - type: macro-recall
+      value: 0.3333
+---
+# BILSTM: Hate Speech Detection for Vietnamese Text
+This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
+on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
+## Model Details
+* **Base Model**: unknown
+* **Description**: bilstm fine-tuned for Vietnamese Hate Speech Detection
+* **Architecture**: Unknown
+* **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
+* **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
+* **Task**: Hate Speech Classification (3 classes)
+### Hyperparameters
+* **Batch size**: `32`
+* **Learning rate**: `2e-5`
+* **Epochs**: `100`
+* **Max sequence length**: `256`
+* **Weight decay**: `0.01`
+* **Warmup steps**: `500`
+* **Early stopping patience**: `5`
+* **Optimizer**: AdamW
+* **Learning rate scheduler**: Cosine with warmup
+## Dataset
+Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
+### Label Descriptions:
+* **CLEAN (0)**: Normal content without offensive language
+* **OFFENSIVE (1)**: Mildly offensive or inappropriate content
+* **HATE (2)**: Hate speech, extremist language, severe threats
+## Evaluation Results
+The model was evaluated on test set with the following metrics:
+* **Accuracy**: `0.8388`
+* **Macro-F1**: `0.3041`
+* **Weighted-F1**: `0.7652`
+* **Macro-Precision**: `0.2796`
+* **Macro-Recall**: `0.3333`
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "visolex/bilstm-hsd"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_name
+)
+# Classify text
+text = "Văn bản tiếng Việt cần phân loại"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_label = torch.argmax(predictions, dim=-1).item()
+# Label mapping
+label_names = {
+    0: "CLEAN",
+    1: "OFFENSIVE",
+    2: "HATE"
+}
+print(f"Predicted label: {label_names[predicted_label]}")
+print(f"Confidence scores: {predictions[0].tolist()}")
+```
+**⚠️ Note for Vocab-based Models**: This model (`bilstm`) uses custom vocabulary-based tokenization and does not include a Hugging Face tokenizer. You will need to implement custom tokenization or load a tokenizer from a compatible base model. The model expects word-level tokenized input.
+## Training Details
+### Training Data
+- **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
+- **Total samples**: ~10,000 Vietnamese comments from social media
+- **Training split**: ~70%
+- **Validation split**: ~15%
+- **Test split**: ~15%
+### Training Configuration
+- **Framework**: PyTorch + HuggingFace Transformers
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Batch Size**: 32
+- **Max Length**: 256 tokens
+- **Epochs**: 100 (with early stopping patience: 5)
+- **Weight Decay**: 0.01
+- **Warmup Steps**: 500
+## Contact & Support
+- **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
+- **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
+- **Questions**: Open a discussion on the model's Hugging Face page
+## License
+This model is distributed under the MIT License.
+## Acknowledgments
+- Base model: [unknown](https://huggingface.co/unknown)
+- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
+- Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
+- ViSoLex Toolkit
+---