|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: vinai/bartpho-syllable |
|
|
tags: |
|
|
- vietnamese |
|
|
- spam-detection |
|
|
- text-classification |
|
|
- e-commerce |
|
|
datasets: |
|
|
- ViSpamReviews |
|
|
metrics: |
|
|
- accuracy |
|
|
- macro-f1 |
|
|
- macro-precision |
|
|
- macro-recall |
|
|
model-index: |
|
|
- name: bartpho-spam-binary |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Spam Review Detection |
|
|
dataset: |
|
|
name: ViSpamReviews |
|
|
type: ViSpamReviews |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.8751 |
|
|
- type: macro-f1 |
|
|
value: 0.8358 |
|
|
--- |
|
|
# bartpho-spam-binary: Spam Review Detection for Vietnamese Text |
|
|
|
|
|
This model is a fine-tuned version of [vinai/bartpho-syllable](https://huggingface.co/vinai/bartpho-syllable) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Base Model**: `vinai/bartpho-syllable` |
|
|
* **Description**: BART Pho - Vietnamese BART model |
|
|
* **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset) |
|
|
* **Fine-tuning Framework**: HuggingFace Transformers |
|
|
* **Task**: Spam Review Detection (binary) |
|
|
* **Number of Classes**: 2 |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
* Max sequence length: `256` |
|
|
* Learning rate: `5e-5` |
|
|
* Batch size: `32` |
|
|
* Epochs: `100` |
|
|
* Early stopping patience: `5` |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes: |
|
|
|
|
|
* **Train set**: 14,299 samples (72%) |
|
|
* **Validation set**: 1,590 samples (8%) |
|
|
* **Test set**: 3,971 samples (20%) |
|
|
|
|
|
### Label Distribution |
|
|
|
|
|
|
|
|
* **Non-spam** (0): Genuine product reviews |
|
|
* **Spam** (1): Fake or promotional reviews |
|
|
|
|
|
## Results |
|
|
|
|
|
The model was evaluated on the test set with the following metrics: |
|
|
|
|
|
* **Accuracy**: `0.8751` |
|
|
* **Macro-F1**: `0.8358` |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model for spam review detection in Vietnamese text. Below is an example: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "visolex/bartpho-spam-binary" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example review text |
|
|
text = "Sản phẩm này rất tốt, shop giao hàng nhanh!" |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predicted_class = outputs.logits.argmax(dim=-1).item() |
|
|
probabilities = torch.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
|
|
|
# Map to label |
|
|
label_map = {0: "Non-spam", 1: "Spam"} |
|
|
predicted_label = label_map[predicted_class] |
|
|
confidence = probabilities[0][predicted_class].item() |
|
|
|
|
|
print(f"Text: {text}") |
|
|
print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})") |
|
|
|
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{{ |
|
|
{model_key}_spam_detection, |
|
|
title={{{description}}}, |
|
|
author={{ViSoLex Team}}, |
|
|
year={{2025}}, |
|
|
howpublished={{\url{{https://huggingface.co/{visolex/bartpho-spam-binary}}}}} |
|
|
}} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache-2.0 license. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
* Base model: [{base_model}](https://huggingface.co/{base_model}) |
|
|
* Dataset: ViSpamReviews (Vietnamese Spam Review Dataset) |
|
|
* ViSoLex Toolkit for Vietnamese NLP |
|
|
|