bilstm-hsd / README.md

Upload README.md with huggingface_hub

386f414 verified 12 days ago

4.35 kB

	---
	license: mit
	base_model: unknown
	tags:
	- vietnamese
	- hate-speech-detection
	- text-classification
	- offensive-language-detection
	datasets:
	- visolex/vihsd
	metrics:
	- accuracy
	- macro-f1
	- weighted-f1
	model-index:
	- name: bilstm-hsd
	results:
	- task:
	type: text-classification
	name: Hate Speech Detection
	dataset:
	name: ViHSD
	type: hate-speech-detection
	metrics:
	- type: accuracy
	value: 0.8388
	- type: macro-f1
	value: 0.3041
	- type: weighted-f1
	value: 0.7652
	- type: macro-precision
	value: 0.2796
	- type: macro-recall
	value: 0.3333
	---

	# BILSTM: Hate Speech Detection for Vietnamese Text

	This model is a fine-tuned version of [unknown](https://huggingface.co/unknown)
	on the ViHSD (Vietnamese Hate Speech Detection Dataset) for classifying Vietnamese text into three categories: CLEAN, OFFENSIVE, and HATE.

	## Model Details

	* Base Model: unknown
	* Description: bilstm fine-tuned for Vietnamese Hate Speech Detection
	* Architecture: Unknown
	* Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	* Fine-tuning Framework: HuggingFace Transformers + PyTorch
	* Task: Hate Speech Classification (3 classes)

	### Hyperparameters

	* Batch size: `32`
	* Learning rate: `2e-5`
	* Epochs: `100`
	* Max sequence length: `256`
	* Weight decay: `0.01`
	* Warmup steps: `500`
	* Early stopping patience: `5`
	* Optimizer: AdamW
	* Learning rate scheduler: Cosine with warmup

	## Dataset

	Model was trained on ViHSD (Vietnamese Hate Speech Detection Dataset) containing ~10,000 Vietnamese comments from social media.

	### Label Descriptions:

	* CLEAN (0): Normal content without offensive language
	* OFFENSIVE (1): Mildly offensive or inappropriate content
	* HATE (2): Hate speech, extremist language, severe threats

	## Evaluation Results

	The model was evaluated on test set with the following metrics:

	* Accuracy: `0.8388`
	* Macro-F1: `0.3041`
	* Weighted-F1: `0.7652`
	* Macro-Precision: `0.2796`
	* Macro-Recall: `0.3333`

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "visolex/bilstm-hsd"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(
	model_name
	)

	# Classify text
	text = "Văn bản tiếng Việt cần phân loại"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(predictions, dim=-1).item()

	# Label mapping
	label_names = {
	0: "CLEAN",
	1: "OFFENSIVE",
	2: "HATE"
	}

	print(f"Predicted label: {label_names[predicted_label]}")
	print(f"Confidence scores: {predictions[0].tolist()}")
	```


	⚠️ Note for Vocab-based Models: This model (`bilstm`) uses custom vocabulary-based tokenization and does not include a Hugging Face tokenizer. You will need to implement custom tokenization or load a tokenizer from a compatible base model. The model expects word-level tokenized input.


	## Training Details

	### Training Data
	- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	- Total samples: ~10,000 Vietnamese comments from social media
	- Training split: ~70%
	- Validation split: ~15%
	- Test split: ~15%

	### Training Configuration
	- Framework: PyTorch + HuggingFace Transformers
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Batch Size: 32
	- Max Length: 256 tokens
	- Epochs: 100 (with early stopping patience: 5)
	- Weight Decay: 0.01
	- Warmup Steps: 500


	## Contact & Support

	- GitHub: [ViSoLex Hate Speech Detection](https://github.com/visolex/hate-speech-detection)
	- Issues: [Report Issues](https://github.com/visolex/hate-speech-detection/issues)
	- Questions: Open a discussion on the model's Hugging Face page

	## License

	This model is distributed under the MIT License.

	## Acknowledgments

	- Base model: [unknown](https://huggingface.co/unknown)
	- Dataset: ViHSD (Vietnamese Hate Speech Detection Dataset)
	- Framework: [Hugging Face Transformers](https://huggingface.co/transformers)
	- ViSoLex Toolkit

	---