File size: 5,193 Bytes
61d55be
 
 
 
 
 
 
 
d405036
61d55be
 
 
 
 
d405036
5bcd0bb
d405036
 
 
 
 
 
 
 
 
 
 
 
 
 
61d55be
 
 
d405036
61d55be
d405036
61d55be
d405036
61d55be
d405036
 
 
 
 
 
 
61d55be
d405036
61d55be
d405036
61d55be
d405036
61d55be
d405036
 
61d55be
5bcd0bb
d405036
 
0903518
61d55be
0903518
 
 
bd760f9
b5b0f44
d405036
61d55be
d405036
61d55be
d405036
61d55be
 
 
 
 
d405036
5bcd0bb
61d55be
 
 
d405036
61d55be
 
d405036
 
086e5c8
61d55be
 
 
 
d405036
61d55be
 
 
 
 
 
 
 
d405036
61d55be
d405036
61d55be
d405036
 
 
 
 
0903518
 
 
 
 
7da3b2e
0903518
7da3b2e
0903518
d405036
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a7ea92
1371b14
 
 
 
 
 
 
 
 
 
 
 
d405036
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
- tr
tags:
- text-classification
- bert
- offensive-language-detection
- turkish
- boun-tabilab
datasets:
- offenseval-tr
metrics:
- accuracy
- f1
model-index:
- name: atahanuz/bert-offensive-classifier
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: OffensEval-2020-TR
      type: offenseval-tr
    metrics:
      - name: Accuracy
        type: accuracy
        value: 0.936
      - name: F1
        type: f1
        value: 0.912
base_model: boun-tabilab/TabiBERT
---

# Turkish Offensive Language Classifier (BERT)

This model is a fine-tuned version of [**boun-tabilab/TabiBERT**](https://huggingface.co/boun-tabilab/TabiBERT) trained on the **OffensEval-2020-TR** dataset. It is designed to perform binary classification to detect offensive language in Turkish text.

## 📊 Model Details

| Feature | Description |
| :--- | :--- |
| **Model Architecture** | BERT (Base Uncased Turkish - TabiBERT) |
| **Task** | Binary Text Classification (Offensive vs. Not Offensive) |
| **Language** | Turkish (tr) |
| **Dataset** | OffensEval 2020 (Turkish Subtask) |
| **Trained By** | atahanuz |

## 🚀 Usage

The easiest way to use this model is via the Hugging Face `pipeline`.

### Method 1: Using the Pipeline (Recommended)

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="atahanuz/bert-offensive-classifier")

text = "Bu harika bir filmdi, çok beğendim."
result = classifier(text)[0]

label = "Offensive" if result['label'] == "LABEL_1" else "Not Offensive"

print(f"Prediction: {label} (Score: {result['score']:.4f})")

# Prediction: Not Offensive (Score: 1.0000)
```

### Method 2: Manual PyTorch Implementation

If you need more control over the tokens or logits, use the standard `AutoModel` approach:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 1. Load model and tokenizer
model_name = "atahanuz/bert-offensive-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2. Define label mapping
id2label = {0: "NOT", 1: "OFF"}

# 3. Tokenize and predict
text = "Bu harika bir filmdi, çok beğendim." # Example text
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    logits = model(**inputs).logits

# 4. Get results
predicted_class_id = logits.argmax().item()
predicted_label = id2label[predicted_class_id]
confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()

print(f"Text: {text}")
print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
```

## 🏷️ Label Mapping

The model outputs the following labels:

| Label ID | Label Name | Description |
| :--- | :--- | :--- |
| `0` | **NOT** | **Not Offensive** - Normal, non-hateful speech. |
| `1` | **OFF** | **Offensive** - Contains insults, threats, or inappropriate language. |

## 📝 Example Predictions

| Text | Label | Prediction |
| :--- | :--- | :--- |
| "Bu filmi çok beğendim, oyunculuklar harikaydı." | **NOT** | Non-Offensive |
| "Beynini kullanmayı denesen belki anlarsın." | **OFF** | Offensive |
| "Maalesef bu konuda sana katılamıyorum." | **NOT** | Non-Offensive |
| "Senin gibi aptal insanlar yüzünden bu haldeyiz." | **OFF** | Offensive |

## 📈 Performance

The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples).

- **Accuracy:** `93.6%`
- **F1 Score:** `91.2%`

### Dataset Statistics
- **Training Samples:** 31,277
- **Test Samples:** 3,529

## ⚠️ Limitations and Bias

* **Context Sensitivity:** Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence.
* **Dataset Bias:** The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish.
* **False Positives:** Certain colloquialisms or "tough love" expressions might be misclassified as offensive.

## 📚 Citation

If you use this model, please cite the TabiLAB model and OffensEval paper:

```bibtex
@misc{Türker2025Tabibert,
      title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish}, 
      author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı},
      year={2025},
      eprint={2512.23065},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.23065}, 
}
```

```bibtex
@inproceedings{zampieri-etal-2020-semeval,
    title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)",
    author = "Zampieri, Marcos  and
      Nakov, Preslav  and
      Rosenthal, Sara  and
      Atanasova, Pepa  and
      Karadzhov, Georgi  and
      Mubarak, Hamdy  and
      Derczynski, Leon  and
      Pym, Z",
    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
    year = "2020",
    publisher = "International Committee for Computational Linguistics",
}
```