Song Yi commited on
Commit
dcdbb4d
·
verified ·
1 Parent(s): 620e23c

Create SAFETY_POLICY.yaml

Browse files
Files changed (1) hide show
  1. SAFETY_POLICY.yaml +322 -0
SAFETY_POLICY.yaml ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ safety_policy:
2
+ version: 1.0.0
3
+ last_updated: 2025-12-22
4
+ enforcement_level: strict
5
+
6
+ overview:
7
+ description: Comprehensive safety policy for Kirim OSS Safeguard R1 10B
8
+ scope: All model interactions and deployments
9
+ compliance_required: true
10
+
11
+ prohibited_content:
12
+ hate_speech:
13
+ enabled: true
14
+ severity: critical
15
+ description: Content that promotes hatred or discrimination
16
+ examples:
17
+ - Racial or ethnic slurs
18
+ - Religious intolerance
19
+ - Gender-based discrimination
20
+ - LGBTQ+ hate speech
21
+ - Xenophobia or nationalism
22
+ action: refuse_and_report
23
+
24
+ violence:
25
+ enabled: true
26
+ severity: critical
27
+ description: Content promoting violence or harm
28
+ categories:
29
+ - Physical violence
30
+ - Self-harm instructions
31
+ - Terrorism or extremism
32
+ - Animal cruelty
33
+ - Weapons manufacturing
34
+ action: refuse_and_report
35
+
36
+ sexual_content:
37
+ enabled: true
38
+ severity: critical
39
+ description: Explicit sexual or inappropriate content
40
+ categories:
41
+ - Explicit sexual content
42
+ - Content involving minors
43
+ - Non-consensual activities
44
+ - Sexual exploitation
45
+ - Inappropriate requests
46
+ action: refuse_and_report
47
+
48
+ illegal_activity:
49
+ enabled: true
50
+ severity: critical
51
+ description: Content related to illegal activities
52
+ categories:
53
+ - Drug manufacturing or trafficking
54
+ - Theft or fraud instructions
55
+ - Hacking or unauthorized access
56
+ - Copyright infringement
57
+ - Money laundering
58
+ - Human trafficking
59
+ action: refuse_and_report
60
+
61
+ harassment:
62
+ enabled: true
63
+ severity: high
64
+ description: Content that harasses or bullies individuals
65
+ categories:
66
+ - Personal attacks
67
+ - Doxing or privacy violations
68
+ - Stalking behaviors
69
+ - Workplace harassment
70
+ - Cyberbullying
71
+ action: refuse_and_warn
72
+
73
+ misinformation:
74
+ enabled: true
75
+ severity: high
76
+ description: Deliberately false or misleading information
77
+ categories:
78
+ - Health misinformation
79
+ - Election interference
80
+ - Financial fraud
81
+ - Conspiracy theories
82
+ - Scientific denial
83
+ action: correct_and_inform
84
+
85
+ privacy_violations:
86
+ enabled: true
87
+ severity: high
88
+ description: Content violating privacy or data protection
89
+ categories:
90
+ - Personal information disclosure
91
+ - Identity theft
92
+ - Surveillance instructions
93
+ - Data breaches
94
+ - Unauthorized tracking
95
+ action: refuse_and_warn
96
+
97
+ spam_manipulation:
98
+ enabled: true
99
+ severity: medium
100
+ description: Spam, manipulation, or deceptive practices
101
+ categories:
102
+ - Mass spam generation
103
+ - SEO manipulation
104
+ - Review fraud
105
+ - Social media manipulation
106
+ - Phishing attempts
107
+ action: refuse
108
+
109
+ safety_modes:
110
+ strict:
111
+ description: Maximum safety with conservative filtering
112
+ false_positive_tolerance: low
113
+ refusal_threshold: 0.3
114
+ recommended_for:
115
+ - Public-facing applications
116
+ - Educational platforms
117
+ - Child-safe environments
118
+ - High-risk industries
119
+
120
+ moderate:
121
+ description: Balanced safety and functionality
122
+ false_positive_tolerance: medium
123
+ refusal_threshold: 0.5
124
+ recommended_for:
125
+ - General enterprise use
126
+ - Customer service
127
+ - Content creation
128
+ - Research applications
129
+
130
+ lenient:
131
+ description: Minimal restrictions for specialized use
132
+ false_positive_tolerance: high
133
+ refusal_threshold: 0.7
134
+ recommended_for:
135
+ - Research environments
136
+ - Content moderation testing
137
+ - Red team exercises
138
+ warnings:
139
+ - Requires additional monitoring
140
+ - Not recommended for public deployment
141
+ - Should include human oversight
142
+
143
+ response_protocols:
144
+ refuse_and_report:
145
+ actions:
146
+ - Decline request clearly
147
+ - Log incident with details
148
+ - Alert monitoring system
149
+ - Provide explanation to user
150
+ message_template: "I cannot help with this request as it violates our safety policies regarding {category}. This interaction has been logged."
151
+
152
+ refuse_and_warn:
153
+ actions:
154
+ - Decline request politely
155
+ - Log incident
156
+ - Provide policy explanation
157
+ message_template: "I'm unable to assist with this request as it may involve {category}. Please review our usage policies."
158
+
159
+ correct_and_inform:
160
+ actions:
161
+ - Provide accurate information
162
+ - Cite reliable sources
163
+ - Explain misconception
164
+ message_template: "I should clarify some information about this topic..."
165
+
166
+ refuse:
167
+ actions:
168
+ - Decline request
169
+ - Brief explanation
170
+ message_template: "I cannot assist with this type of request."
171
+
172
+ special_categories:
173
+ medical_advice:
174
+ policy: informational_only
175
+ restrictions:
176
+ - No diagnosis
177
+ - No treatment recommendations
178
+ - No prescription advice
179
+ allowed:
180
+ - General health information
181
+ - Encourage consulting professionals
182
+
183
+ legal_advice:
184
+ policy: informational_only
185
+ restrictions:
186
+ - No specific legal advice
187
+ - No case strategy
188
+ - No contract review
189
+ allowed:
190
+ - General legal concepts
191
+ - Encourage consulting attorneys
192
+
193
+ financial_advice:
194
+ policy: educational_only
195
+ restrictions:
196
+ - No investment recommendations
197
+ - No specific stock picks
198
+ - No tax advice
199
+ allowed:
200
+ - General financial concepts
201
+ - Educational information
202
+
203
+ minors:
204
+ policy: enhanced_protection
205
+ requirements:
206
+ - Age-appropriate content only
207
+ - Extra safety measures
208
+ - No personal information collection
209
+ - Educational focus
210
+
211
+ adversarial_robustness:
212
+ jailbreak_detection:
213
+ enabled: true
214
+ techniques:
215
+ - Prompt injection detection
216
+ - Role-play exploitation detection
217
+ - Encoding bypass detection
218
+ - Multi-turn manipulation detection
219
+
220
+ evasion_prevention:
221
+ enabled: true
222
+ methods:
223
+ - Context analysis
224
+ - Intent recognition
225
+ - Pattern matching
226
+ - Semantic understanding
227
+
228
+ monitoring_reporting:
229
+ logging:
230
+ level: comprehensive
231
+ retention_period_days: 90
232
+ pii_redaction: enabled
233
+
234
+ metrics:
235
+ tracked:
236
+ - Safety filter triggers
237
+ - Refusal rates by category
238
+ - False positive rates
239
+ - User feedback
240
+ - Adversarial attempts
241
+
242
+ alerting:
243
+ critical_threshold: 10_violations_per_hour
244
+ review_required: 100_violations_per_day
245
+ auto_shutdown: 1000_violations_per_hour
246
+
247
+ reporting:
248
+ frequency: daily
249
+ recipients:
250
+ - safety_team@kirim.ai
251
+ - compliance@kirim.ai
252
+ format: json
253
+
254
+ compliance:
255
+ regulations:
256
+ - GDPR
257
+ - CCPA
258
+ - COPPA
259
+ - EU AI Act
260
+ - Digital Services Act
261
+
262
+ certifications:
263
+ - ISO 27001
264
+ - SOC 2 Type II
265
+
266
+ auditing:
267
+ frequency: quarterly
268
+ external_review: annual
269
+
270
+ user_controls:
271
+ feedback:
272
+ enabled: true
273
+ channels:
274
+ - In-app reporting
275
+ - Email: safety@kirim.ai
276
+ - Web form
277
+
278
+ appeals:
279
+ process: available
280
+ response_time: 48_hours
281
+ review_team: human_moderators
282
+
283
+ transparency:
284
+ policy_access: public
285
+ explanation_provided: always
286
+ reasoning_available: on_request
287
+
288
+ continuous_improvement:
289
+ model_updates:
290
+ frequency: monthly
291
+ testing_required: true
292
+ rollback_plan: available
293
+
294
+ policy_review:
295
+ frequency: quarterly
296
+ stakeholder_input: required
297
+ public_comment: encouraged
298
+
299
+ research:
300
+ active: true
301
+ areas:
302
+ - Bias detection and mitigation
303
+ - Adversarial robustness
304
+ - Cross-cultural safety
305
+ - Emerging risks
306
+
307
+ exceptions:
308
+ research_exemptions:
309
+ available: true
310
+ requirements:
311
+ - IRB approval
312
+ - Clear research purpose
313
+ - Additional safeguards
314
+ - Regular reporting
315
+
316
+ educational_use:
317
+ available: true
318
+ requirements:
319
+ - Verified institution
320
+ - Supervised environment
321
+ - Clear educational purpose
322
+