File size: 5,193 Bytes
61d55be d405036 61d55be d405036 5bcd0bb d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 61d55be 5bcd0bb d405036 0903518 61d55be 0903518 bd760f9 b5b0f44 d405036 61d55be d405036 61d55be d405036 61d55be d405036 5bcd0bb 61d55be d405036 61d55be d405036 086e5c8 61d55be d405036 61d55be d405036 61d55be d405036 61d55be d405036 0903518 7da3b2e 0903518 7da3b2e 0903518 d405036 7a7ea92 1371b14 d405036 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language:
- tr
tags:
- text-classification
- bert
- offensive-language-detection
- turkish
- boun-tabilab
datasets:
- offenseval-tr
metrics:
- accuracy
- f1
model-index:
- name: atahanuz/bert-offensive-classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: OffensEval-2020-TR
type: offenseval-tr
metrics:
- name: Accuracy
type: accuracy
value: 0.936
- name: F1
type: f1
value: 0.912
base_model: boun-tabilab/TabiBERT
---
# Turkish Offensive Language Classifier (BERT)
This model is a fine-tuned version of [**boun-tabilab/TabiBERT**](https://huggingface.co/boun-tabilab/TabiBERT) trained on the **OffensEval-2020-TR** dataset. It is designed to perform binary classification to detect offensive language in Turkish text.
## 📊 Model Details
| Feature | Description |
| :--- | :--- |
| **Model Architecture** | BERT (Base Uncased Turkish - TabiBERT) |
| **Task** | Binary Text Classification (Offensive vs. Not Offensive) |
| **Language** | Turkish (tr) |
| **Dataset** | OffensEval 2020 (Turkish Subtask) |
| **Trained By** | atahanuz |
## 🚀 Usage
The easiest way to use this model is via the Hugging Face `pipeline`.
### Method 1: Using the Pipeline (Recommended)
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="atahanuz/bert-offensive-classifier")
text = "Bu harika bir filmdi, çok beğendim."
result = classifier(text)[0]
label = "Offensive" if result['label'] == "LABEL_1" else "Not Offensive"
print(f"Prediction: {label} (Score: {result['score']:.4f})")
# Prediction: Not Offensive (Score: 1.0000)
```
### Method 2: Manual PyTorch Implementation
If you need more control over the tokens or logits, use the standard `AutoModel` approach:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# 1. Load model and tokenizer
model_name = "atahanuz/bert-offensive-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# 2. Define label mapping
id2label = {0: "NOT", 1: "OFF"}
# 3. Tokenize and predict
text = "Bu harika bir filmdi, çok beğendim." # Example text
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
# 4. Get results
predicted_class_id = logits.argmax().item()
predicted_label = id2label[predicted_class_id]
confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
print(f"Text: {text}")
print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
```
## 🏷️ Label Mapping
The model outputs the following labels:
| Label ID | Label Name | Description |
| :--- | :--- | :--- |
| `0` | **NOT** | **Not Offensive** - Normal, non-hateful speech. |
| `1` | **OFF** | **Offensive** - Contains insults, threats, or inappropriate language. |
## 📝 Example Predictions
| Text | Label | Prediction |
| :--- | :--- | :--- |
| "Bu filmi çok beğendim, oyunculuklar harikaydı." | **NOT** | Non-Offensive |
| "Beynini kullanmayı denesen belki anlarsın." | **OFF** | Offensive |
| "Maalesef bu konuda sana katılamıyorum." | **NOT** | Non-Offensive |
| "Senin gibi aptal insanlar yüzünden bu haldeyiz." | **OFF** | Offensive |
## 📈 Performance
The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples).
- **Accuracy:** `93.6%`
- **F1 Score:** `91.2%`
### Dataset Statistics
- **Training Samples:** 31,277
- **Test Samples:** 3,529
## ⚠️ Limitations and Bias
* **Context Sensitivity:** Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence.
* **Dataset Bias:** The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish.
* **False Positives:** Certain colloquialisms or "tough love" expressions might be misclassified as offensive.
## 📚 Citation
If you use this model, please cite the TabiLAB model and OffensEval paper:
```bibtex
@misc{Türker2025Tabibert,
title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish},
author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı},
year={2025},
eprint={2512.23065},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.23065},
}
```
```bibtex
@inproceedings{zampieri-etal-2020-semeval,
title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)",
author = "Zampieri, Marcos and
Nakov, Preslav and
Rosenthal, Sara and
Atanasova, Pepa and
Karadzhov, Georgi and
Mubarak, Hamdy and
Derczynski, Leon and
Pym, Z",
booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
year = "2020",
publisher = "International Committee for Computational Linguistics",
}
``` |