visolex
/

bilstm-hsd

@@ -1,172 +0,0 @@
----
-license: mit
-base_model: unknown
-tags:
-- vietnamese
-- hate-speech-detection
-- text-classification
-- offensive-language-detection
-datasets:
-- visolex/vihsd
-metrics:
-- accuracy
-- macro-f1
-- weighted-f1
-model-index:
-- name: bilstm-hsd
-  results:
-  - task:
-      type: text-classification
-      name: Hate Speech Detection
-    dataset:
-      name: ViHSD
-      type: hate-speech-detection
-    metrics:
-    - type: accuracy
-      value: 0.8388
-    - type: macro-f1
-      value: 0.3041
-    - type: weighted-f1
-      value: 0.7652
-    - type: macro-precision
-      value: 0.2796
-    - type: macro-recall
-      value: 0.3333
----
-# BILSTM: Hate Speech Detection for Vietnamese Text
-This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
-on the **ViHSD (Vietnamese Hate Speech Detection Dataset)** for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.
-## Model Details
-* **Base Model**: unknown
-* **Description**: bilstm fine-tuned for Vietnamese Hate Speech Detection
-* **Architecture**: Unknown
-* **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
-* **Fine-tuning Framework**: HuggingFace Transformers + PyTorch
-* **Task**: Hate Speech Classification (3 classes)
-### Hyperparameters
-* **Batch size**: `32`
-* **Learning rate**: `2e-5`
-* **Epochs**: `100`
-* **Max sequence length**: `256`
-* **Weight decay**: `0.01`
-* **Warmup steps**: `500`
-* **Early stopping patience**: `5`
-* **Optimizer**: AdamW
-* **Learning rate scheduler**: Cosine with warmup
-## Dataset
-Model was trained on **ViHSD (Vietnamese Hate Speech Detection Dataset)** containing ~10,000 Vietnamese comments from social media.
-### Label Descriptions:
-* **CLEAN (0)**: Normal content without offensive language
-* **OFFENSIVE (1)**: Mildly offensive or inappropriate content
-* **HATE (2)**: Hate speech, extremist language, severe threats
-## Evaluation Results
-The model was evaluated on test set with the following metrics:
-* **Accuracy**: `0.8388`
-* **Macro-F1**: `0.3041`
-* **Weighted-F1**: `0.7652`
-* **Macro-Precision**: `0.2796`
-* **Macro-Recall**: `0.3333`
-### Basic Usage
-**⚠️ Important**: This model uses custom architecture. You must use `trust_remote_code=True` when loading.
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-# Load model and tokenizer
-model_name = "visolex/bilstm-hsd"
-# Load tokenizer
-# Note: For vocab-based models (bilstm, textcnn), use base model tokenizer or custom tokenization
-if "bilstm" in ["bilstm", "textcnn"]:
-    # These models use custom vocabulary - tokenizer from base model may not work
-    # You need to implement custom tokenization based on the model's vocabulary
-    print("⚠️  Note: This model uses custom vocabulary-based tokenization")
-    print("   Please refer to the model's documentation for tokenization details")
-    tokenizer = None
-else:
-    # Load tokenizer from the model repo (it will use base model's tokenizer)
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-# Load model with trust_remote_code=True (REQUIRED for custom models)
-model = AutoModelForSequenceClassification.from_pretrained(
-    model_name,
-    trust_remote_code=True  # ⚠️ REQUIRED: Allows loading custom model classes from models.py
-)
-# Classify text
-if tokenizer is not None:
-    text = "Văn bản tiếng Việt cần phân loại"
-    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
-    with torch.no_grad():
-        outputs = model(**inputs)
-        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
-        predicted_label = torch.argmax(predictions, dim=-1).item()
-    # Label mapping
-    label_names = {
-        0: "CLEAN",
-        1: "OFFENSIVE",
-        2: "HATE"
-    }
-    print(f"Predicted label: {label_names[predicted_label]}")
-    print(f"Confidence scores: {predictions[0].tolist()}")
-else:
-    print("Please implement custom tokenization for this vocab-based model")
-```
-## Training Details
-### Training Data
-- **Dataset**: ViHSD (Vietnamese Hate Speech Detection Dataset)
-- **Total samples**: ~10,000 Vietnamese comments from social media
-- **Training split**: ~70%
-- **Validation split**: ~15%
-- **Test split**: ~15%
-### Training Configuration
-- **Framework**: PyTorch + HuggingFace Transformers
-- **Optimizer**: AdamW
-- **Learning Rate**: 2e-5
-- **Batch Size**: 32
-- **Max Length**: 256 tokens
-- **Epochs**: 100 (with early stopping patience: 5)
-- **Weight Decay**: 0.01
-- **Warmup Steps**: 500
-## Contact & Support
-- **GitHub**: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
-- **Issues**: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
-- **Questions**: Open a discussion on the model's Hugging Face page
-## License
-This model is distributed under the MIT License.
-## Acknowledgments
-- Base model: [unknown](https://huggingface.co/unknown)
-- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
-- Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
-- ViSoLex Toolkit
----