AnnyNguyen commited on
Commit
774112f
·
verified ·
1 Parent(s): 927da28

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ tags:
5
+ - hate-speech-detection
6
+ - vietnamese-nlp
7
+ - text-classification
8
+ - offensive-speech
9
+ license: mit
10
+ datasets:
11
+ - vihsd
12
+ base_model: Unknown
13
+ ---
14
+
15
+ # TEXTCNN
16
+
17
+ textcnn fine-tuned cho bài toán phân loại Hate Speech.
18
+
19
+ ## Model Details
20
+
21
+ - **Model type**: Fine-tuned transformer model
22
+ - **Architecture**: Unknown
23
+ - **Base model**: [Unknown](https://huggingface.co/Unknown)
24
+ - **Task**: Hate Speech Classification
25
+ - **Language**: Vietnamese
26
+ - **Labels**: CLEAN (0), OFFENSIVE (1), HATE (2)
27
+
28
+ ## 📊 Model Performance
29
+
30
+ | Metric | Score |
31
+ |--------|-------|
32
+ | Accuracy | 0.8388 |
33
+ | F1 Macro | 0.3041 |
34
+ | F1 Weighted | 0.7652 |
35
+
36
+
37
+ ## Model Description
38
+
39
+ textcnn fine-tuned cho bài toán phân loại Hate Speech. Model này được fine-tune từ `Unknown` trên dataset ViHSD (Vietnamese Hate Speech Dataset).
40
+
41
+ ## How to Use
42
+
43
+ ### Basic Usage
44
+
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
47
+ import torch
48
+
49
+ # Load model and tokenizer
50
+ model_name = "visolex/hate-speech-textcnn"
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
53
+
54
+ # Classify text
55
+ text = "Văn bản tiếng Việt cần phân loại"
56
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
57
+
58
+ with torch.no_grad():
59
+ outputs = model(**inputs)
60
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
61
+ predicted_label = torch.argmax(predictions, dim=-1).item()
62
+
63
+ # Label mapping
64
+ label_names = {
65
+ 0: "CLEAN",
66
+ 1: "OFFENSIVE",
67
+ 2: "HATE"
68
+ }
69
+
70
+ print(f"Predicted label: {label_names[predicted_label]}")
71
+ print(f"Confidence scores: {predictions[0].tolist()}")
72
+ ```
73
+
74
+ ### Using the Pipeline
75
+
76
+ ```python
77
+ from transformers import pipeline
78
+
79
+ classifier = pipeline(
80
+ "text-classification",
81
+ model="visolex/hate-speech-textcnn",
82
+ tokenizer="visolex/hate-speech-textcnn"
83
+ )
84
+
85
+ result = classifier("Văn bản tiếng Việt cần phân loại")
86
+ print(result)
87
+ ```
88
+
89
+ ## Training Details
90
+
91
+ ### Training Data
92
+ - Dataset: ViHSD (Vietnamese Hate Speech Dataset)
93
+ - Training samples: ~8,000 samples
94
+ - Validation samples: ~1,000 samples
95
+ - Test samples: ~1,000 samples
96
+
97
+ ### Training Procedure
98
+ - Framework: PyTorch + Transformers
99
+ - Optimizer: AdamW
100
+ - Learning Rate: 2e-5
101
+ - Batch Size: 32
102
+ - Epochs: Varies by model
103
+ - Max Sequence Length: 256
104
+
105
+ ### Label Distribution
106
+ - CLEAN (0): Normal content without offensive language
107
+ - OFFENSIVE (1): Mildly offensive content
108
+ - HATE (2): Hate speech and extremist language
109
+
110
+ ## Evaluation
111
+
112
+ Model được đánh giá trên test set của ViHSD với các metrics:
113
+ - Accuracy: Overall classification accuracy
114
+ - F1 Macro: Macro-averaged F1 score across all labels
115
+ - F1 Weighted: Weighted F1 score based on label frequency
116
+
117
+ ## Limitations and Bias
118
+
119
+ - Model chỉ được train trên dữ liệu tiếng Việt từ mạng xã hội
120
+ - Performance có thể giảm trên domain khác (email, document, etc.)
121
+ - Model có thể có bias từ dữ liệu training
122
+ - Cần đánh giá thêm trên dữ liệu real-world
123
+
124
+ ## Citation
125
+
126
+
127
+ ## Contact
128
+
129
+
130
+ ## License
131
+
132
+ This model is distributed under the MIT License.