File size: 4,447 Bytes


---
language: tr
tags:
  - toxicity
  - text-classification
  - turkish
  - transformers
  - bert
license: mit
datasets:
  - Overfit-GM/turkish-toxic-language
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: Turkish Toxic Language Detection Model
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          name: Turkish Toxic Language Dataset
          type: Overfit-GM/turkish-toxic-language
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.96
          - name: F1
            type: f1
            value: 0.96
          - name: Precision
            type: precision
            value: 0.96
          - name: Recall
            type: recall
            value: 0.96
---

# 🇹🇷 Turkish Toxic Language Detection Model 🧠🔥

This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in **Turkish** text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset. 

## 📊 Performance

| Metric       | Non-Toxic | Toxic | Macro Avg |
|--------------|-----------|-------|-----------|
| Precision    | 0.96      | 0.95  | 0.96      |
| Recall       | 0.95      | 0.96  | 0.96      |
| F1-score     | 0.96      | 0.96  | 0.96      |
| Accuracy     |           |       | **0.96**  |
| Test Samples | 5400      | 5414  | 10814     |

### Confusion Matrix

|               | Pred: Non-Toxic | Pred: Toxic |
|---------------|-----------------|-------------|
| **True: Non-Toxic** | 5154            | 246         |
| **True: Toxic**     | 200             | 5214        |

## 🧪 Preprocessing Details (cleaned_corrected_text)

The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how:

### 🔧 Cleaning Function

```python
def clean_corrected_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)  # URL removal
    text = re.sub(r"@\w+", '', text)  # remove @mentions
    text = re.sub(r"[^\w\s.,!?-]", '', text)  # remove special characters (e.g., emojis)
    text = re.sub(r"\s+", ' ', text).strip()  # normalize whitespaces
    return text
```

### 🧹 Manual Slang Filtering

```python
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]

def remove_slang(text):
    for word in slang_words:
        text = text.replace(word, "")
    return text.strip()
```

### ✅ Applied Steps Summary

| Step                  | Description |
|------------------------|-------------|
| Lowercasing            | All text is converted to lowercase |
| URL removal            | Removes links containing http, www, https |
| Mention removal        | Removes @username style mentions |
| Special character removal | Removes emojis and symbols (😊, *, %, $, ^, etc.) |
| Whitespace normalization | Collapses multiple spaces into one |
| Slang word removal     | Removes common informal words like "kanka", "lan", etc. |

📌 **Conclusion**: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.

## 💡 Example Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")

def predict_toxicity(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model(**inputs)
    predicted = torch.argmax(outputs.logits, dim=1).item()
    return "Toxic" if predicted == 1 else "Non-Toxic"
```

## 🛠 Training Details

- **Trainer**: Hugging Face `Trainer` API
- **Epochs**: 3
- **Batch size**: 16
- **Learning Rate**: 2e-5
- **Eval Strategy**: Epoch-based
- **Undersampling**: Applied to balance class distribution

## 📁 Dataset

Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language)  
Final dataset size after preprocessing and balancing: 54068 samples