visolex
/

textcnn-hsd

+---
+language:
+- vi
+tags:
+- hate-speech-detection
+- vietnamese-nlp
+- text-classification
+- offensive-speech
+license: mit
+datasets:
+- vihsd
+base_model: Unknown
+---
+# TEXTCNN
+textcnn fine-tuned cho bài toán phân loại Hate Speech.
+## Model Details
+- **Model type**: Fine-tuned transformer model
+- **Architecture**: Unknown
+- **Base model**: [Unknown](https://huggingface.co/Unknown)
+- **Task**: Hate Speech Classification
+- **Language**: Vietnamese
+- **Labels**: CLEAN (0), OFFENSIVE (1), HATE (2)
+## 📊 Model Performance
+| Metric | Score |
+|--------|-------|
+| Accuracy | 0.8388 |
+| F1 Macro | 0.3041 |
+| F1 Weighted | 0.7652 |
+## Model Description
+textcnn fine-tuned cho bài toán phân loại Hate Speech. Model này được fine-tune từ `Unknown` trên dataset ViHSD (Vietnamese Hate Speech Dataset).
+## How to Use
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "visolex/hate-speech-textcnn"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Classify text
+text = "Văn bản tiếng Việt cần phân loại"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_label = torch.argmax(predictions, dim=-1).item()
+# Label mapping
+label_names = {
+    0: "CLEAN",
+    1: "OFFENSIVE",
+    2: "HATE"
+}
+print(f"Predicted label: {label_names[predicted_label]}")
+print(f"Confidence scores: {predictions[0].tolist()}")
+```
+### Using the Pipeline
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="visolex/hate-speech-textcnn",
+    tokenizer="visolex/hate-speech-textcnn"
+)
+result = classifier("Văn bản tiếng Việt cần phân loại")
+print(result)
+```
+## Training Details
+### Training Data
+- Dataset: ViHSD (Vietnamese Hate Speech Dataset)
+- Training samples: ~8,000 samples
+- Validation samples: ~1,000 samples
+- Test samples: ~1,000 samples
+### Training Procedure
+- Framework: PyTorch + Transformers
+- Optimizer: AdamW
+- Learning Rate: 2e-5
+- Batch Size: 32
+- Epochs: Varies by model
+- Max Sequence Length: 256
+### Label Distribution
+- CLEAN (0): Normal content without offensive language
+- OFFENSIVE (1): Mildly offensive content
+- HATE (2): Hate speech and extremist language
+## Evaluation
+Model được đánh giá trên test set của ViHSD với các metrics:
+- Accuracy: Overall classification accuracy
+- F1 Macro: Macro-averaged F1 score across all labels
+- F1 Weighted: Weighted F1 score based on label frequency
+## Limitations and Bias
+- Model chỉ được train trên dữ liệu tiếng Việt từ mạng xã hội
+- Performance có thể giảm trên domain khác (email, document, etc.)
+- Model có thể có bias từ dữ liệu training
+- Cần đánh giá thêm trên dữ liệu real-world
+## Citation
+## Contact
+## License
+This model is distributed under the MIT License.