AnnyNguyen commited on
Commit
6799311
·
verified ·
1 Parent(s): 4969edb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: vinai/phobert-base
4
+ tags:
5
+ - vietnamese
6
+ - spam-detection
7
+ - text-classification
8
+ - e-commerce
9
+ datasets:
10
+ - ViSpamReviews
11
+ metrics:
12
+ - accuracy
13
+ - macro-f1
14
+ - macro-precision
15
+ - macro-recall
16
+ model-index:
17
+ - name: textcnn-spam-multi-class
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Spam Review Detection
22
+ dataset:
23
+ name: ViSpamReviews
24
+ type: ViSpamReviews
25
+ metrics:
26
+ - type: accuracy
27
+ value: 0.7220
28
+ - type: macro-f1
29
+ value: 0.2096
30
+ ---
31
+ # textcnn-spam-multi-class: Spam Review Detection for Vietnamese Text
32
+
33
+ This model is a fine-tuned version of [vinai/phobert-base](https://huggingface.co/vinai/phobert-base) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews.
34
+
35
+ ## Model Details
36
+
37
+ * **Base Model**: `vinai/phobert-base`
38
+ * **Description**: TextCNN - Convolutional Neural Network for text
39
+ * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset)
40
+ * **Fine-tuning Framework**: HuggingFace Transformers
41
+ * **Task**: Spam Review Detection (multi-class)
42
+ * **Number of Classes**: 4
43
+
44
+ ### Hyperparameters
45
+
46
+ * Max sequence length: `256`
47
+ * Learning rate: `5e-5`
48
+ * Batch size: `32`
49
+ * Epochs: `100`
50
+ * Early stopping patience: `5`
51
+
52
+ ## Dataset
53
+
54
+ The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
55
+
56
+ * **Train set**: 14,299 samples (72%)
57
+ * **Validation set**: 1,590 samples (8%)
58
+ * **Test set**: 3,971 samples (20%)
59
+
60
+ ### Label Distribution
61
+
62
+
63
+ * **NO-SPAM** (0): Genuine product reviews
64
+ * **SPAM-1** (1): Fake review (synthetic/manipulated reviews)
65
+ * **SPAM-2** (2): Brand-only reviews (only mention brand without product details)
66
+ * **SPAM-3** (3): Irrelevant reviews (unrelated content)
67
+
68
+ ## Results
69
+
70
+ The model was evaluated on the test set with the following metrics:
71
+
72
+ * **Accuracy**: `0.7220`
73
+ * **Macro-F1**: `0.2096`
74
+
75
+
76
+ ## Usage
77
+
78
+ You can use this model for spam review detection in Vietnamese text. Below is an example:
79
+
80
+ ```python
81
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
82
+ import torch
83
+
84
+ # Load model and tokenizer
85
+ model_name = "visolex/textcnn-spam-multiclass"
86
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
87
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
88
+
89
+ # Example review text
90
+ text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"
91
+
92
+ # Tokenize
93
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
94
+
95
+ # Predict
96
+ with torch.no_grad():
97
+ outputs = model(**inputs)
98
+ predicted_class = outputs.logits.argmax(dim=-1).item()
99
+ probabilities = torch.softmax(outputs.logits, dim=-1)
100
+
101
+
102
+ # Map to label
103
+ label_map = {
104
+ 0: "NO-SPAM",
105
+ 1: "SPAM-1 (fake review)",
106
+ 2: "SPAM-2 (brand-only)",
107
+ 3: "SPAM-3 (irrelevant)"
108
+ }
109
+ predicted_label = label_map[predicted_class]
110
+ confidence = probabilities[0][predicted_class].item()
111
+
112
+ print(f"Text: {text}")
113
+ print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
114
+
115
+ ```
116
+
117
+ ## Citation
118
+
119
+ If you use this model, please cite:
120
+
121
+ ```bibtex
122
+ @misc{{
123
+ {model_key}_spam_detection,
124
+ title={{{description}}},
125
+ author={{ViSoLex Team}},
126
+ year={{2025}},
127
+ howpublished={{\url{{https://huggingface.co/{visolex/textcnn-spam-multiclass}}}}}
128
+ }}
129
+ ```
130
+
131
+ ## License
132
+
133
+ This model is released under the Apache-2.0 license.
134
+
135
+ ## Acknowledgments
136
+
137
+ * Base model: [{base_model}](https://huggingface.co/{base_model})
138
+ * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
139
+ * ViSoLex Toolkit for Vietnamese NLP