--- language: - tr tags: - text-classification - bert - offensive-language-detection - turkish - boun-tabilab datasets: - offenseval-tr metrics: - accuracy - f1 model-index: - name: atahanuz/bert-offensive-classifier results: - task: type: text-classification name: Text Classification dataset: name: OffensEval-2020-TR type: offenseval-tr metrics: - name: Accuracy type: accuracy value: 0.936 - name: F1 type: f1 value: 0.912 base_model: boun-tabilab/TabiBERT --- # Turkish Offensive Language Classifier (BERT) This model is a fine-tuned version of [**boun-tabilab/TabiBERT**](https://huggingface.co/boun-tabilab/TabiBERT) trained on the **OffensEval-2020-TR** dataset. It is designed to perform binary classification to detect offensive language in Turkish text. ## 📊 Model Details | Feature | Description | | :--- | :--- | | **Model Architecture** | BERT (Base Uncased Turkish - TabiBERT) | | **Task** | Binary Text Classification (Offensive vs. Not Offensive) | | **Language** | Turkish (tr) | | **Dataset** | OffensEval 2020 (Turkish Subtask) | | **Trained By** | atahanuz | ## 🚀 Usage The easiest way to use this model is via the Hugging Face `pipeline`. ### Method 1: Using the Pipeline (Recommended) ```python from transformers import pipeline classifier = pipeline("text-classification", model="atahanuz/bert-offensive-classifier") text = "Bu harika bir filmdi, çok beğendim." result = classifier(text)[0] label = "Offensive" if result['label'] == "LABEL_1" else "Not Offensive" print(f"Prediction: {label} (Score: {result['score']:.4f})") # Prediction: Not Offensive (Score: 1.0000) ``` ### Method 2: Manual PyTorch Implementation If you need more control over the tokens or logits, use the standard `AutoModel` approach: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # 1. Load model and tokenizer model_name = "atahanuz/bert-offensive-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # 2. Define label mapping id2label = {0: "NOT", 1: "OFF"} # 3. Tokenize and predict text = "Bu harika bir filmdi, çok beğendim." # Example text inputs = tokenizer(text, return_tensors="pt", truncation=True) with torch.no_grad(): logits = model(**inputs).logits # 4. Get results predicted_class_id = logits.argmax().item() predicted_label = id2label[predicted_class_id] confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item() print(f"Text: {text}") print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})") ``` ## 🏷️ Label Mapping The model outputs the following labels: | Label ID | Label Name | Description | | :--- | :--- | :--- | | `0` | **NOT** | **Not Offensive** - Normal, non-hateful speech. | | `1` | **OFF** | **Offensive** - Contains insults, threats, or inappropriate language. | ## 📝 Example Predictions | Text | Label | Prediction | | :--- | :--- | :--- | | "Bu filmi çok beğendim, oyunculuklar harikaydı." | **NOT** | Non-Offensive | | "Beynini kullanmayı denesen belki anlarsın." | **OFF** | Offensive | | "Maalesef bu konuda sana katılamıyorum." | **NOT** | Non-Offensive | | "Senin gibi aptal insanlar yüzünden bu haldeyiz." | **OFF** | Offensive | ## 📈 Performance The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples). - **Accuracy:** `93.6%` - **F1 Score:** `91.2%` ### Dataset Statistics - **Training Samples:** 31,277 - **Test Samples:** 3,529 ## ⚠️ Limitations and Bias * **Context Sensitivity:** Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence. * **Dataset Bias:** The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish. * **False Positives:** Certain colloquialisms or "tough love" expressions might be misclassified as offensive. ## 📚 Citation If you use this model, please cite the TabiLAB model and OffensEval paper: ```bibtex @misc{Türker2025Tabibert, title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish}, author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı}, year={2025}, eprint={2512.23065}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.23065}, } ``` ```bibtex @inproceedings{zampieri-etal-2020-semeval, title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)", author = "Zampieri, Marcos and Nakov, Preslav and Rosenthal, Sara and Atanasova, Pepa and Karadzhov, Georgi and Mubarak, Hamdy and Derczynski, Leon and Pym, Z", booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation", year = "2020", publisher = "International Committee for Computational Linguistics", } ```