File size: 4,447 Bytes
ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 dfc488a 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 95d74c3 ae16494 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
language: tr
tags:
- toxicity
- text-classification
- turkish
- transformers
- bert
license: mit
datasets:
- Overfit-GM/turkish-toxic-language
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: Turkish Toxic Language Detection Model
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: Turkish Toxic Language Dataset
type: Overfit-GM/turkish-toxic-language
metrics:
- name: Accuracy
type: accuracy
value: 0.96
- name: F1
type: f1
value: 0.96
- name: Precision
type: precision
value: 0.96
- name: Recall
type: recall
value: 0.96
---
# πΉπ· Turkish Toxic Language Detection Model π§ π₯
This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in **Turkish** text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset.
## π Performance
| Metric | Non-Toxic | Toxic | Macro Avg |
|--------------|-----------|-------|-----------|
| Precision | 0.96 | 0.95 | 0.96 |
| Recall | 0.95 | 0.96 | 0.96 |
| F1-score | 0.96 | 0.96 | 0.96 |
| Accuracy | | | **0.96** |
| Test Samples | 5400 | 5414 | 10814 |
### Confusion Matrix
| | Pred: Non-Toxic | Pred: Toxic |
|---------------|-----------------|-------------|
| **True: Non-Toxic** | 5154 | 246 |
| **True: Toxic** | 200 | 5214 |
## π§ͺ Preprocessing Details (cleaned_corrected_text)
The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how:
### π§ Cleaning Function
```python
def clean_corrected_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # URL removal
text = re.sub(r"@\w+", '', text) # remove @mentions
text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis)
text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces
return text
```
### π§Ή Manual Slang Filtering
```python
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]
def remove_slang(text):
for word in slang_words:
text = text.replace(word, "")
return text.strip()
```
### β
Applied Steps Summary
| Step | Description |
|------------------------|-------------|
| Lowercasing | All text is converted to lowercase |
| URL removal | Removes links containing http, www, https |
| Mention removal | Removes @username style mentions |
| Special character removal | Removes emojis and symbols (π, *, %, $, ^, etc.) |
| Whitespace normalization | Collapses multiple spaces into one |
| Slang word removal | Removes common informal words like "kanka", "lan", etc. |
π **Conclusion**: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.
## π‘ Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")
def predict_toxicity(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
outputs = model(**inputs)
predicted = torch.argmax(outputs.logits, dim=1).item()
return "Toxic" if predicted == 1 else "Non-Toxic"
```
## π Training Details
- **Trainer**: Hugging Face `Trainer` API
- **Epochs**: 3
- **Batch size**: 16
- **Learning Rate**: 2e-5
- **Eval Strategy**: Epoch-based
- **Undersampling**: Applied to balance class distribution
## π Dataset
Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language)
Final dataset size after preprocessing and balancing: 54068 samples
|