metadata
language:
- tr
tags:
- text-classification
- bert
- offensive-language-detection
- turkish
- boun-tabilab
datasets:
- offenseval-tr
metrics:
- accuracy
- f1
model-index:
- name: atahanuz/bert-offensive-classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: OffensEval-2020-TR
type: offenseval-tr
metrics:
- name: Accuracy
type: accuracy
value: 0.936
- name: F1
type: f1
value: 0.912
base_model: boun-tabilab/TabiBERT
Turkish Offensive Language Classifier (BERT)
This model is a fine-tuned version of boun-tabilab/TabiBERT trained on the OffensEval-2020-TR dataset. It is designed to perform binary classification to detect offensive language in Turkish text.
📊 Model Details
| Feature | Description |
|---|---|
| Model Architecture | BERT (Base Uncased Turkish - TabiBERT) |
| Task | Binary Text Classification (Offensive vs. Not Offensive) |
| Language | Turkish (tr) |
| Dataset | OffensEval 2020 (Turkish Subtask) |
| Trained By | atahanuz |
🚀 Usage
The easiest way to use this model is via the Hugging Face pipeline.
Method 1: Using the Pipeline (Recommended)
from transformers import pipeline
classifier = pipeline("text-classification", model="atahanuz/bert-offensive-classifier")
text = "Bu harika bir filmdi, çok beğendim."
result = classifier(text)[0]
label = "Offensive" if result['label'] == "LABEL_1" else "Not Offensive"
print(f"Prediction: {label} (Score: {result['score']:.4f})")
# Prediction: Not Offensive (Score: 1.0000)
Method 2: Manual PyTorch Implementation
If you need more control over the tokens or logits, use the standard AutoModel approach:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# 1. Load model and tokenizer
model_name = "atahanuz/bert-offensive-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# 2. Define label mapping
id2label = {0: "NOT", 1: "OFF"}
# 3. Tokenize and predict
text = "Bu harika bir filmdi, çok beğendim." # Example text
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
# 4. Get results
predicted_class_id = logits.argmax().item()
predicted_label = id2label[predicted_class_id]
confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
print(f"Text: {text}")
print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
🏷️ Label Mapping
The model outputs the following labels:
| Label ID | Label Name | Description |
|---|---|---|
0 |
NOT | Not Offensive - Normal, non-hateful speech. |
1 |
OFF | Offensive - Contains insults, threats, or inappropriate language. |
📝 Example Predictions
| Text | Label | Prediction |
|---|---|---|
| "Bu filmi çok beğendim, oyunculuklar harikaydı." | NOT | Non-Offensive |
| "Beynini kullanmayı denesen belki anlarsın." | OFF | Offensive |
| "Maalesef bu konuda sana katılamıyorum." | NOT | Non-Offensive |
| "Senin gibi aptal insanlar yüzünden bu haldeyiz." | OFF | Offensive |
📈 Performance
The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples).
- Accuracy:
93.6% - F1 Score:
91.2%
Dataset Statistics
- Training Samples: 31,277
- Test Samples: 3,529
⚠️ Limitations and Bias
- Context Sensitivity: Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence.
- Dataset Bias: The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish.
- False Positives: Certain colloquialisms or "tough love" expressions might be misclassified as offensive.
📚 Citation
If you use this model, please cite the TabiLAB model and OffensEval paper:
@misc{Türker2025Tabibert,
title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish},
author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı},
year={2025},
eprint={2512.23065},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.23065},
}
@inproceedings{zampieri-etal-2020-semeval,
title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)",
author = "Zampieri, Marcos and
Nakov, Preslav and
Rosenthal, Sara and
Atanasova, Pepa and
Karadzhov, Georgi and
Mubarak, Hamdy and
Derczynski, Leon and
Pym, Z",
booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
year = "2020",
publisher = "International Committee for Computational Linguistics",
}