metadata language: tr
license: other
license_name: siriusai-premium-v1
license_link: LICENSE
tags:
- turkish
- content-moderation
- multi-label-classification
- text-classification
- safety
- moderation
- bert
- nlp
- transformers
base_model: dbmdz/bert-base-turkish-uncased
datasets:
- custom
metrics:
- f1
- precision
- recall
- accuracy
- mcc
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: turkish-safety
results:
- task:
type: text-classification
name: Multi-label Content Safety Classification
metrics:
- type: f1
value: 0.9165
name: Macro F1
- type: mcc
value: 0.9045
name: Matthews Correlation Coefficient
Turkish Safety - Content Moderation Classifier v5.0
Multi-label classification model for Turkish content moderation
Developed by SiriusAI Tech Brain Team
Mission
Empowering digital platforms with AI-driven content safety solutions.
Turkish Safety is an advanced NLP model that analyzes Turkish content in real-time and detects harmful content across 7 different categories. It provides comprehensive content moderation for social media platforms, messaging applications, in-game chats, and community forums.
Why This Model Matters
7 Risk Categories : Detects SAFE, GROOMING, SEXUAL, OFFENSIVE, BULLYING, SELF_HARM, and THREAT
Turkish-First Design : Optimized for Turkish linguistics and cultural context using BERTurk
Production-Ready : <50ms inference, battle-tested architecture, enterprise-grade reliability
Multi-Label Intelligence : Smart classification that understands content can belong to multiple categories
Expert Validation : Curated training data with clear category boundaries and edge case handling
Model Overview
Property
Value
Architecture
BERT (Bidirectional Encoder Representations from Transformers)
Base Model
dbmdz/bert-base-turkish-uncased (BERTurk)
Task
Multi-label Text Classification
Language
Turkish (tr)
Categories
7 content safety labels
Model Size
443 MB (FP32)
Inference Time
~10-15ms (GPU) / ~40-50ms (CPU)
Performance Metrics
Final Evaluation Results (Epoch 2)
Metric
Score
Description
Macro F1
0.9165
Harmonic mean of precision and recall across all categories
MCC
0.9045
Matthews Correlation Coefficient (robust multi-class metric)
Eval Loss
0.0268
Focal loss on validation set
Training Progress
Epoch
Train Loss
Eval Loss
Macro F1
MCC
1
0.038
0.0282
0.9085
0.8957
2
0.038
0.0268
0.9165
0.9045
Validation Test Results (86.4% Accuracy)
Category
Test Cases
Correct
Notes
SAFE
5
4
One false positive (compliment → offensive)
GROOMING
4
2
Boundary cases with SEXUAL/THREAT
SEXUAL
3
3
Perfect detection
OFFENSIVE
3
3
Perfect detection
THREAT
3
3
Perfect detection
SELF_HARM
2
2
Perfect detection
BULLYING
2
2
Perfect detection
Dataset
Dataset Statistics
Split
Samples
Purpose
Train
68,128
Model training
Test
17,033
Model evaluation
Total
85,161
Complete dataset
Category Distribution (Full Dataset)
Category
Samples
Percentage
Description
SAFE
25,488
29.9%
Benign, normal communication
SELF_HARM
14,234
16.7%
Self-harm ideation, suicidal thoughts
BULLYING
13,259
15.6%
Harassment, exclusion, cyberbullying
THREAT
9,193
10.8%
Physical threats, violence, blackmail
SEXUAL
8,642
10.1%
Sexual content, body comments
GROOMING
7,517
8.8%
Manipulation, trust-building tactics
OFFENSIVE
6,849
8.0%
Profanity, slurs, offensive language
Subcategory Breakdown
Category
Subcategories
SAFE
greetings (1,958), farewells (1,485), wellbeing_questions (2,900), daily_conversation (2,435), weather_talk (1,445), food_drink (1,481), normal_questions (1,861), school_talk (1,961), family_talk (1,487), hobbies_games (1,455), sports_talk (1,000), tech_internet (994), genuine_compliments (1,000), encouragement (1,000), appreciation (1,000), apology_understanding (998), help_cooperation (1,000)
GROOMING
secrecy (953), isolation (729), trust_manipulation (792), meeting_private (701), gift_promise (565), age_questioning (688), private_communication (628), emotional_manipulation (654), normalization (655), excessive_flattery (559), testing_boundaries (583)
THREAT
physical_violence (1,307), weapon_threat (936), blackmail (1,168), family_threat (1,071), implicit_threat (906), revenge (947), death_threat (886), social_threat (930), stalking_threat (532), property_threat (500)
OFFENSIVE
insults (1,286), cursing_sik (1,535), cursing_am (1,398), cursing_ana_orospu (1,383), derogatory (849), mockery (398)
SEXUAL
explicit_content (1,085), sexual_body_focus (1,612), sexual_invitation (1,237), pornographic (1,060), sexual_questions (1,232), romantic_pressure (1,030), inappropriate_comments (856), sexual_fantasy (530)
BULLYING
exclusion (1,904), mockery_repeated (1,690), emotional_abuse (1,678), appearance_attack (1,490), public_humiliation (1,091), intimidation (979), cyberbullying (1,138), name_calling (1,178), spreading_rumors (1,000), academic_bullying (1,111)
SELF_HARM
hopelessness (1,923), giving_up (1,690), not_waking_up (1,435), suicide_ideation (1,413), self_harm_plan (1,532), burden_feeling (1,018), worthlessness (1,037), isolation_feeling (1,025), goodbye_messages (807), self_blame (894), depression_signs (1,452)
Data Generation Methodology
Synthetic Generation : LLM-based generation with expert-defined category boundaries
Hard Negative Mining : Difficult edge cases for boundary discrimination
Quality Filtering : Duplicate detection, minimum word count, forbidden token filtering
Parallel Processing : 20 concurrent workers with batch size of 50
Pass Rate : 97.5% average acceptance rate across all categories
Label Definitions
The model classifies text into 7 mutually non-exclusive categories:
Label
ID
Description
Turkish Examples
SAFE
0
Benign, normal communication
"Bugün hava güzel", "Oyun oynayalım mı?"
OFFENSIVE
1
Profanity, slurs, offensive language
"Aptal mısın", "Salak herif"
SELF_HARM
2
Self-harm ideation, suicidal thoughts
"Ölmek istiyorum", "Kendimi kesmek istiyorum"
GROOMING
3
Manipulation, trust-building, isolation tactics
"Kimseye söyleme", "Sen özelsin", "Evime gel"
BULLYING
4
Harassment, exclusion, cyberbullying
"Kimse seninle oynamak istemiyor", "Çirkinsin"
SEXUAL
5
Sexual content, body comments, inappropriate questions
"Vücudun güzel", "Hiç öpüştün mü?", "Ne giyiyorsun?"
THREAT
6
Physical threats, violence, blackmail
"Seni döverim", "Fotoğrafını yayarım"
Important: Category Boundaries
GROOMING vs SEXUAL Distinction:
GROOMING : Non-sexual manipulation tactics (trust-building, secrecy, gift promises, meeting requests)
SEXUAL : Any body-related comments, physical compliments, sexual questions, explicit content
"Kimseye söyleme tamam mı?" → GROOMING (secrecy/isolation)
"Vücudun çok güzel" → SEXUAL (body comment)
"Telefon alırım sana" → GROOMING (gift promise)
"Dudakların çok güzel" → SEXUAL (body-focused compliment)
"Gel evime yalnızım" → GROOMING (meeting request/isolation)
"Hiç öpüştün mü?" → SEXUAL (sexual experience question)
Training Procedure
Hyperparameters
Parameter
Value
Base Model
dbmdz/bert-base-turkish-uncased
Max Sequence Length
64 tokens
Batch Size
16 (effective 32 with gradient accumulation)
Gradient Accumulation
2 steps
Learning Rate
2e-5 (with cosine restarts)
Epochs
2
Optimizer
AdamW
Weight Decay
0.01
Warmup Ratio
0.1
Loss Function
Focal Loss (gamma=1.2)
Label Smoothing
0.05
Problem Type
Multi-label Classification
Evaluation Strategy
Per epoch
Training Environment
Resource
Specification
Hardware
Apple M1 Pro (MPS)
Framework
PyTorch 2.x + Transformers 4.37+
Training Time
~14 minutes (864 seconds)
Throughput
157.8 samples/second
Steps
4,258 total
Learning Rate Schedule
Peak LR: 2e-5 (after warmup)
Schedule: Cosine with restarts
Final LR: ~1.1e-8
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "hayatiali/turkish-safety"
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased" )
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval ()
LABELS = ["SAFE" , "OFFENSIVE" , "SELF_HARM" , "GROOMING" , "BULLYING" , "SEXUAL" , "THREAT" ]
def predict (text ):
inputs = tokenizer(text, return_tensors="pt" , truncation=True , max_length=128 )
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits)[0 ].numpy()
scores = {label: float (prob) for label, prob in zip (LABELS, probs)}
primary = max (scores, key=scores.get)
return {"category" : primary, "confidence" : scores[primary], "all_scores" : scores}
print (predict("Vücudun çok güzel" ))
print (predict("Kimseye söyleme tamam mı" ))
print (predict("Ölmek istiyorum" ))
print (predict("Bugün hava güzel" ))
Production Class
class TurkishSafetyClassifier :
LABELS = ["SAFE" , "OFFENSIVE" , "SELF_HARM" , "GROOMING" , "BULLYING" , "SEXUAL" , "THREAT" ]
def __init__ (self, model_path="hayatiali/turkish-safety" ):
self.tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased" )
self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
self.model.to(self.device).eval ()
def predict (self, text: str ) -> dict :
inputs = self.tokenizer(text, return_tensors="pt" , truncation=True , max_length=128 )
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.sigmoid(logits)[0 ].cpu().numpy()
scores = dict (zip (self.LABELS, probs))
primary = max (scores, key=scores.get)
return {
"category" : primary,
"confidence" : scores[primary],
"scores" : scores,
"action" : self._get_action(scores[primary], primary)
}
def _get_action (self, score: float , category: str ) -> str :
if category in ["GROOMING" , "SEXUAL" , "SELF_HARM" , "THREAT" ]:
if score > 0.5 : return "hard_block"
if score > 0.3 : return "soft_block"
if score > 0.75 : return "hard_block"
if score > 0.60 : return "soft_block"
if score > 0.45 : return "flag"
if score > 0.30 : return "allow_log"
return "allow"
Batch Inference
def predict_batch (texts: list , batch_size: int = 32 ) -> list :
results = []
for i in range (0 , len (texts), batch_size):
batch = texts[i:i + batch_size]
inputs = tokenizer(batch, return_tensors="pt" , truncation=True , max_length=128 , padding=True )
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
probs = torch.sigmoid(model(**inputs).logits).cpu().numpy()
for prob in probs:
scores = dict (zip (LABELS, prob))
results.append(scores)
return results
Limitations & Known Issues
⚠️ Evaluation Limitations
Note : Two separate evaluation sets exist:
Automated Test Set : 17,033 samples from test.csv → Macro F1: 0.9165, MCC: 0.9045
Manual Edge Case Test : 22 hand-picked samples → 86.4% accuracy (19/22 correct)
Limitation
Details
Impact
Small Manual Test Set
Edge case validation on only 22 samples (86.4%)
Manual test not statistically significant; automated metrics (17K samples) more reliable
No Per-Class Metrics
Only Macro F1 and MCC reported for 17K test set
Cannot assess individual category performance (e.g., SELF_HARM Precision/Recall vs SAFE)
No Confusion Matrix
Category confusion patterns not documented
Unclear which categories are most confused beyond GROOMING/SEXUAL boundary
No PR/ROC Curves
Precision-Recall and ROC analysis not performed
Optimal threshold selection methodology not documented
No Calibration Analysis
Model confidence calibration not tested
Unknown if 0.7 confidence truly represents 70% probability
⚠️ Architectural Limitations
Limitation
Details
Impact
Short Context Window
Max sequence length: 64 tokens
Long messages may lose critical information; truncation may remove key context
Single-Turn Only
No conversation history analysis
GROOMING patterns often emerge across multiple messages ("Kaç yaşındasın?", "Nerelisin?", "Fotoğraf atar mısın?" may each appear SAFE individually)
No Temporal Patterns
No escalation detection capability
Cannot detect behavior changes over time; user history not considered
Static Analysis
Each message analyzed independently
Contextual red flags from message sequences not captured
⚠️ Data & Coverage Limitations
Limitation
Details
Impact
Dialect/Slang Gaps
Regional dialects and internet slang underrepresented
Performance may degrade on: "napıon", "nbr", "slm", "mrb", regional variations
No Adversarial Testing
Evasion techniques not systematically tested
Unknown robustness against: "S 3 x" instead of "sex", character substitution, unicode tricks
Synthetic Data Bias
97.5% of training data is LLM-generated
May not capture real-world linguistic patterns; potential distribution shift
Spelling Error Tolerance
Not explicitly tested
Common typos and intentional misspellings may bypass detection
⚠️ Production Deployment Considerations
Consideration
Details
Recommendation
Threshold Selection
Current thresholds (0.3, 0.5, 0.75) are heuristic
Perform PR curve analysis for your specific use case; adjust based on FP/FN tolerance
Confidence Calibration
Model may be over/under-confident
Consider temperature scaling or Platt calibration before production
Category Boundaries
GROOMING ↔ SEXUAL boundary is known issue
Review flagged content in these categories; implement human review for edge cases
Real-Time Context
No session-level analysis
Consider implementing sliding window or conversation aggregation layer
Not Suitable For
Languages other than Turkish
Adult content moderation (requires different domain expertise)
Sole decision-making without human review for high-stakes situations
Legal evidence or court proceedings
Detection of sophisticated, multi-turn grooming attempts without additional context layer
Highly informal/slang-heavy communications without additional preprocessing
Ethical Considerations
Intended Use
Social media content moderation
Messaging platform safety filters
Gaming chat moderation
Community forum monitoring
Parental control applications
Research and educational purposes
Risks
False Negatives : May miss sophisticated grooming attempts
False Positives : May flag benign content incorrectly
Automation Bias : Over-reliance on model predictions
Recommendations
Human Oversight : Always combine with human review for critical decisions
Threshold Calibration : Adjust thresholds based on your risk tolerance
Monitoring : Track performance metrics in production
Regular Updates : Retrain with new data periodically
Transparency : Inform users about automated moderation
Technical Specifications
Model Architecture
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings
(encoder): BertEncoder (12 layers)
(pooler): BertPooler
)
(dropout): Dropout(p=0.1)
(classifier): Linear(in_features=768, out_features=7)
)
Total Parameters: ~110M
Trainable Parameters: ~110M
Input/Output
Input : Turkish text (max 128 tokens)
Output : 7-dimensional probability vector (sigmoid activated)
Tokenizer : BERTurk WordPiece (32k vocab)
Citation
@misc{turkish-safety-2025,
title={Turkish Safety - Content Moderation Classifier},
author={SiriusAI Tech Brain Team},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/hayatiali/turkish-safety}},
note={Fine-tuned from dbmdz/bert-base-turkish-uncased, Macro F1: 0.9076}
}
Model Card Authors
SiriusAI Tech Brain Team
Contact
Changelog
v5.0 (Current)
Major dataset expansion: 85,161 samples (68,128 train / 17,033 test)
Improved metrics: Macro F1: 0.9165 , MCC: 0.9045
Optimized hyperparameters for large dataset (Focal Loss, cosine restarts)
67 subcategories across 7 main categories
86.4% validation accuracy on edge cases
v4.0
Initial production release
7-category multi-label content safety classification
Macro F1: 0.9076, MCC: 0.8931
Training on 30,596 samples
Clear category boundary definitions (GROOMING vs SEXUAL)
Optimized for real-time inference (<50ms)
License : SiriusAI Tech Premium License v1.0
Commercial Use : Requires Premium License. Contact: info@siriusaitech.com
Free Use Allowed For :
Academic research and education
Non-profit organizations (with approval)
Evaluation (30 days)
Disclaimer : This model is designed for content moderation and safety applications. Always implement with appropriate safeguards and human oversight. Model predictions should inform decisions, not replace human judgment.