AnnyNguyen commited on
Commit
7b2e4c4
·
verified ·
1 Parent(s): aec1181

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -49
README.md CHANGED
@@ -1,16 +1,14 @@
1
  ---
2
- language: vi
 
3
  tags:
4
- - hate-speech-detection
5
  - vietnamese
6
- - transformer
7
- license: apache-2.0
 
 
8
  datasets:
9
  - visolex/ViHOS
10
- metrics:
11
- - precision
12
- - recall
13
- - f1
14
  model-index:
15
  - name: visobert-hsd-span
16
  results:
@@ -18,64 +16,66 @@ model-index:
18
  type: token-classification
19
  name: Hate Speech Span Detection
20
  dataset:
21
- name: ViHOS
22
- type: custom
23
  metrics:
24
- - name: Precision
25
- type: precision
26
- value: <INSERT_PRECISION>
27
- - name: Recall
28
- type: recall
29
- value: <INSERT_RECALL>
30
- - name: F1 Score
31
- type: f1
32
- value: <INSERT_F1>
33
- base_model:
34
- - uitnlp/visobert
35
- pipeline_tag: token-classification
36
  ---
37
 
38
- # ViSoBERT-HSD-Span
39
 
40
- This model is fine-tuned from [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert) on the **visolex/ViHOS** dataset for span-level hate/offensive detection in Vietnamese comments.
41
 
42
  ## Model Details
43
 
44
- * **Base Model**: [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert)
45
- * **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS)
46
- * **Fine-tuning**: HuggingFace Transformers
 
47
 
48
  ### Hyperparameters
49
 
50
- * Batch size: `16`
51
- * Learning rate: `5e-5`
52
- * Epochs: `100`
53
- * Max sequence length: `128`
54
- * Early stopping: `5`
 
 
 
 
 
 
 
55
 
56
  ## Usage
57
 
58
  ```python
59
  from transformers import AutoTokenizer, AutoModelForTokenClassification
 
60
 
61
- tokenizer = AutoTokenizer.from_pretrained("visolex/visobert-hsd-span")
62
- model = AutoModelForTokenClassification.from_pretrained("visolex/visobert-hsd-span")
63
-
64
- text = "Nói cái lol . t thấy thô tục vl"
65
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
66
  with torch.no_grad():
67
- outputs = model(**inputs)
68
- logits = outputs.logits # [batch, seq_len, num_labels]
69
- # For binary: use sigmoid, for multi-class: use softmax+argmax
70
- probs = torch.sigmoid(logits)
71
- preds = (probs > 0.5).long().squeeze().tolist() # [seq_len]
72
- tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
73
 
74
- span_labels = [p[0] for p in preds]
75
 
76
- # Lấy token có nhãn span = 1, loại bỏ <s> và </s> nếu muốn
77
- span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]
78
 
79
- print("Span tokens:", span_tokens)
80
- print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))
81
- ```
 
1
  ---
2
+ license: apache-2.0
3
+ base_model: visobert
4
  tags:
 
5
  - vietnamese
6
+ - hate-speech
7
+ - span-detection
8
+ - token-classification
9
+ - nlp
10
  datasets:
11
  - visolex/ViHOS
 
 
 
 
12
  model-index:
13
  - name: visobert-hsd-span
14
  results:
 
16
  type: token-classification
17
  name: Hate Speech Span Detection
18
  dataset:
19
+ name: visolex/ViHOS
20
+ type: visolex/ViHOS
21
  metrics:
22
+ - type: f1
23
+ value: N/A
24
+ - type: precision
25
+ value: N/A
26
+ - type: recall
27
+ value: N/A
28
+ - type: exact_match
29
+ value: 0.1230
 
 
 
 
30
  ---
31
 
32
+ # visobert-hsd-span: Hate Speech Span Detection (Vietnamese)
33
 
34
+ This model is a fine-tuned version of [visobert](https://huggingface.co/visobert) for Vietnamese **Hate Speech Span Detection**.
35
 
36
  ## Model Details
37
 
38
+ - Base Model: `visobert`
39
+ - Description: Vietnamese Hate Speech Span Detection
40
+ - Framework: HuggingFace Transformers
41
+ - Task: Hate Speech Span Detection (token/char-level spans)
42
 
43
  ### Hyperparameters
44
 
45
+ - Max sequence length: `64`
46
+ - Learning rate: `5e-6`
47
+ - Batch size: `32`
48
+ - Epochs: `100`
49
+ - Early stopping patience: `5`
50
+
51
+ ## Results
52
+
53
+ - F1: `N/A`
54
+ - Precision: `N/A`
55
+ - Recall: `N/A`
56
+ - Exact Match: `0.1230`
57
 
58
  ## Usage
59
 
60
  ```python
61
  from transformers import AutoTokenizer, AutoModelForTokenClassification
62
+ import torch
63
 
64
+ model_name = "visobert-hsd-span"
65
+ tok = AutoTokenizer.from_pretrained(model_name)
66
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
67
+ text = " dụ câu tiếng Việt nội dung thù ghét ..."
68
+ enc = tok(text, return_tensors="pt", truncation=True, max_length=256, is_split_into_words=False)
69
  with torch.no_grad():
70
+ logits = model(**enc).logits
71
+ pred_ids = logits.argmax(-1)[0].tolist()
72
+ # TODO: chuyển pred_ids -> spans theo scheme nhãn của bạn (BIO/BILOU/char-offset)
73
+ ```
 
 
74
 
75
+ ## License
76
 
77
+ Apache-2.0
 
78
 
79
+ ## Acknowledgments
80
+
81
+ - Base model: [visobert](https://huggingface.co/visobert)