Improve language tag

#5
by lbourdois - opened
Files changed (1) hide show
  1. README.md +121 -107
README.md CHANGED
@@ -1,108 +1,122 @@
1
- ---
2
- license: apache-2.0
3
- library_name: transformers
4
- pipeline_tag: text-classification
5
- base_model:
6
- - Qwen/Qwen2.5-0.5B
7
- ---
8
-
9
- ## Overview
10
- A brief description of what this model does and how it’s unique or relevant:
11
-
12
- - **Goal**: Classification upon safety of the input text sequences.
13
- - **Model Description**: DuoGuard-0.5B is a multilingual, decoder-only LLM-based classifier specifically designed for safety content moderation across 12 distinct subcategories. Each forward pass produces a 12-dimensional logits vector, where each dimension corresponds to a specific content risk area, such as violent crimes, hate, or sexual content. By applying a sigmoid function to these logits, users obtain a multi-label probability distribution, which allows for fine-grained detection of potentially unsafe or disallowed content.
14
- For simplified binary moderation tasks, the model can be used to produce a single “safe”/“unsafe” label by taking the maximum of the 12 subcategory probabilities and comparing it to a given threshold (e.g., 0.5). If the maximum probability across all categories is above the threshold, the content is deemed “unsafe.” Otherwise, it is considered “safe.”
15
-
16
- DuoGuard-0.5B is built upon Qwen 2.5 (0.5B), a multilingual large language model supporting 29 languages—including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. DuoGuard-0.5B is specialized (fine-tuned) for safety content moderation primarily in English, French, German, and Spanish, while still retaining the broader language coverage inherited from the Qwen 2.5 base model. It is provided with open weights.
17
-
18
- -It is presented in the paper [DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails](https://huggingface.co/papers/2502.05163)
19
-
20
- For code, see https://github.com/yihedeng9/DuoGuard.
21
-
22
- ## How to Use
23
- A quick code snippet on how to load and use the model in an application:
24
- ```python
25
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
26
- import torch
27
-
28
- # 1. Initialize the tokenizer
29
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
30
- tokenizer.pad_token = tokenizer.eos_token
31
-
32
- # 2. Load the DuoGuard-0.5B model
33
- model = AutoModelForSequenceClassification.from_pretrained(
34
- "DuoGuard/DuoGuard-0.5B",
35
- torch_dtype=torch.bfloat16
36
- ).to('cuda:0')
37
-
38
- # 3. Define a sample prompt to test
39
- prompt = "How to kill a python process?"
40
-
41
- # 4. Tokenize the prompt
42
- inputs = tokenizer(
43
- prompt,
44
- return_tensors="pt",
45
- truncation=True,
46
- max_length=512 # adjust as needed
47
- ).to('cuda:0')
48
-
49
- # 5. Run the model (inference)
50
- with torch.no_grad():
51
- outputs = model(**inputs)
52
- # DuoGuard outputs a 12-dimensional vector (one probability per subcategory).
53
- logits = outputs.logits # shape: (batch_size, 12)
54
- probabilities = torch.sigmoid(logits) # element-wise sigmoid
55
-
56
- # 6. Multi-label predictions (one for each category)
57
- threshold = 0.5
58
- category_names = [
59
- "Violent crimes",
60
- "Non-violent crimes",
61
- "Sex-related crimes",
62
- "Child sexual exploitation",
63
- "Specialized advice",
64
- "Privacy",
65
- "Intellectual property",
66
- "Indiscriminate weapons",
67
- "Hate",
68
- "Suicide and self-harm",
69
- "Sexual content",
70
- "Jailbreak prompts",
71
- ]
72
-
73
- # Extract probabilities for the single prompt (batch_size = 1)
74
- prob_vector = probabilities[0].tolist() # shape: (12,)
75
-
76
- predicted_labels = []
77
- for cat_name, prob in zip(category_names, prob_vector):
78
- label = 1 if prob > threshold else 0
79
- predicted_labels.append(label)
80
-
81
- # 7. Overall binary classification: "safe" vs. "unsafe"
82
- # We consider the prompt "unsafe" if ANY category is above the threshold.
83
- max_prob = max(prob_vector)
84
- overall_label = 1 if max_prob > threshold else 0 # 1 => unsafe, 0 => safe
85
-
86
- # 8. Print results
87
- print(f"Prompt: {prompt}\n")
88
- print(f"Multi-label Probabilities (threshold={threshold}):")
89
- for cat_name, prob, label in zip(category_names, prob_vector, predicted_labels):
90
- print(f" - {cat_name}: {prob:.3f}")
91
-
92
- print(f"\nMaximum probability across all categories: {max_prob:.3f}")
93
- print(f"Overall Prompt Classification => {'UNSAFE' if overall_label == 1 else 'SAFE'}")
94
- ```
95
-
96
- ### Citation
97
-
98
- ```plaintext
99
- @misc{deng2025duoguardtwoplayerrldrivenframework,
100
- title={DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails},
101
- author={Yihe Deng and Yu Yang and Junkai Zhang and Wei Wang and Bo Li},
102
- year={2025},
103
- eprint={2502.05163},
104
- archivePrefix={arXiv},
105
- primaryClass={cs.CL},
106
- url={https://arxiv.org/abs/2502.05163},
107
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-classification
5
+ base_model:
6
+ - Qwen/Qwen2.5-0.5B
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ ## Overview
24
+ A brief description of what this model does and how it’s unique or relevant:
25
+
26
+ - **Goal**: Classification upon safety of the input text sequences.
27
+ - **Model Description**: DuoGuard-0.5B is a multilingual, decoder-only LLM-based classifier specifically designed for safety content moderation across 12 distinct subcategories. Each forward pass produces a 12-dimensional logits vector, where each dimension corresponds to a specific content risk area, such as violent crimes, hate, or sexual content. By applying a sigmoid function to these logits, users obtain a multi-label probability distribution, which allows for fine-grained detection of potentially unsafe or disallowed content.
28
+ For simplified binary moderation tasks, the model can be used to produce a single “safe”/“unsafe” label by taking the maximum of the 12 subcategory probabilities and comparing it to a given threshold (e.g., 0.5). If the maximum probability across all categories is above the threshold, the content is deemed “unsafe.” Otherwise, it is considered “safe.”
29
+
30
+ DuoGuard-0.5B is built upon Qwen 2.5 (0.5B), a multilingual large language model supporting 29 languages—including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. DuoGuard-0.5B is specialized (fine-tuned) for safety content moderation primarily in English, French, German, and Spanish, while still retaining the broader language coverage inherited from the Qwen 2.5 base model. It is provided with open weights.
31
+
32
+ -It is presented in the paper [DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails](https://huggingface.co/papers/2502.05163)
33
+
34
+ For code, see https://github.com/yihedeng9/DuoGuard.
35
+
36
+ ## How to Use
37
+ A quick code snippet on how to load and use the model in an application:
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
40
+ import torch
41
+
42
+ # 1. Initialize the tokenizer
43
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
44
+ tokenizer.pad_token = tokenizer.eos_token
45
+
46
+ # 2. Load the DuoGuard-0.5B model
47
+ model = AutoModelForSequenceClassification.from_pretrained(
48
+ "DuoGuard/DuoGuard-0.5B",
49
+ torch_dtype=torch.bfloat16
50
+ ).to('cuda:0')
51
+
52
+ # 3. Define a sample prompt to test
53
+ prompt = "How to kill a python process?"
54
+
55
+ # 4. Tokenize the prompt
56
+ inputs = tokenizer(
57
+ prompt,
58
+ return_tensors="pt",
59
+ truncation=True,
60
+ max_length=512 # adjust as needed
61
+ ).to('cuda:0')
62
+
63
+ # 5. Run the model (inference)
64
+ with torch.no_grad():
65
+ outputs = model(**inputs)
66
+ # DuoGuard outputs a 12-dimensional vector (one probability per subcategory).
67
+ logits = outputs.logits # shape: (batch_size, 12)
68
+ probabilities = torch.sigmoid(logits) # element-wise sigmoid
69
+
70
+ # 6. Multi-label predictions (one for each category)
71
+ threshold = 0.5
72
+ category_names = [
73
+ "Violent crimes",
74
+ "Non-violent crimes",
75
+ "Sex-related crimes",
76
+ "Child sexual exploitation",
77
+ "Specialized advice",
78
+ "Privacy",
79
+ "Intellectual property",
80
+ "Indiscriminate weapons",
81
+ "Hate",
82
+ "Suicide and self-harm",
83
+ "Sexual content",
84
+ "Jailbreak prompts",
85
+ ]
86
+
87
+ # Extract probabilities for the single prompt (batch_size = 1)
88
+ prob_vector = probabilities[0].tolist() # shape: (12,)
89
+
90
+ predicted_labels = []
91
+ for cat_name, prob in zip(category_names, prob_vector):
92
+ label = 1 if prob > threshold else 0
93
+ predicted_labels.append(label)
94
+
95
+ # 7. Overall binary classification: "safe" vs. "unsafe"
96
+ # We consider the prompt "unsafe" if ANY category is above the threshold.
97
+ max_prob = max(prob_vector)
98
+ overall_label = 1 if max_prob > threshold else 0 # 1 => unsafe, 0 => safe
99
+
100
+ # 8. Print results
101
+ print(f"Prompt: {prompt}\n")
102
+ print(f"Multi-label Probabilities (threshold={threshold}):")
103
+ for cat_name, prob, label in zip(category_names, prob_vector, predicted_labels):
104
+ print(f" - {cat_name}: {prob:.3f}")
105
+
106
+ print(f"\nMaximum probability across all categories: {max_prob:.3f}")
107
+ print(f"Overall Prompt Classification => {'UNSAFE' if overall_label == 1 else 'SAFE'}")
108
+ ```
109
+
110
+ ### Citation
111
+
112
+ ```plaintext
113
+ @misc{deng2025duoguardtwoplayerrldrivenframework,
114
+ title={DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails},
115
+ author={Yihe Deng and Yu Yang and Junkai Zhang and Wei Wang and Bo Li},
116
+ year={2025},
117
+ eprint={2502.05163},
118
+ archivePrefix={arXiv},
119
+ primaryClass={cs.CL},
120
+ url={https://arxiv.org/abs/2502.05163},
121
+ }
122
  ```