|
|
|
|
|
--- |
|
|
language: tr |
|
|
tags: |
|
|
- toxicity |
|
|
- text-classification |
|
|
- turkish |
|
|
- transformers |
|
|
- bert |
|
|
license: mit |
|
|
datasets: |
|
|
- Overfit-GM/turkish-toxic-language |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: Turkish Toxic Language Detection Model |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Text Classification |
|
|
dataset: |
|
|
name: Turkish Toxic Language Dataset |
|
|
type: Overfit-GM/turkish-toxic-language |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 0.96 |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.96 |
|
|
- name: Precision |
|
|
type: precision |
|
|
value: 0.96 |
|
|
- name: Recall |
|
|
type: recall |
|
|
value: 0.96 |
|
|
--- |
|
|
|
|
|
# πΉπ· Turkish Toxic Language Detection Model π§ π₯ |
|
|
|
|
|
This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in **Turkish** text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset. |
|
|
|
|
|
## π Performance |
|
|
|
|
|
| Metric | Non-Toxic | Toxic | Macro Avg | |
|
|
|--------------|-----------|-------|-----------| |
|
|
| Precision | 0.96 | 0.95 | 0.96 | |
|
|
| Recall | 0.95 | 0.96 | 0.96 | |
|
|
| F1-score | 0.96 | 0.96 | 0.96 | |
|
|
| Accuracy | | | **0.96** | |
|
|
| Test Samples | 5400 | 5414 | 10814 | |
|
|
|
|
|
### Confusion Matrix |
|
|
|
|
|
| | Pred: Non-Toxic | Pred: Toxic | |
|
|
|---------------|-----------------|-------------| |
|
|
| **True: Non-Toxic** | 5154 | 246 | |
|
|
| **True: Toxic** | 200 | 5214 | |
|
|
|
|
|
## π§ͺ Preprocessing Details (cleaned_corrected_text) |
|
|
|
|
|
The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how: |
|
|
|
|
|
### π§ Cleaning Function |
|
|
|
|
|
```python |
|
|
def clean_corrected_text(text): |
|
|
text = text.lower() |
|
|
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # URL removal |
|
|
text = re.sub(r"@\w+", '', text) # remove @mentions |
|
|
text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis) |
|
|
text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces |
|
|
return text |
|
|
``` |
|
|
|
|
|
### π§Ή Manual Slang Filtering |
|
|
|
|
|
```python |
|
|
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"] |
|
|
|
|
|
def remove_slang(text): |
|
|
for word in slang_words: |
|
|
text = text.replace(word, "") |
|
|
return text.strip() |
|
|
``` |
|
|
|
|
|
### β
Applied Steps Summary |
|
|
|
|
|
| Step | Description | |
|
|
|------------------------|-------------| |
|
|
| Lowercasing | All text is converted to lowercase | |
|
|
| URL removal | Removes links containing http, www, https | |
|
|
| Mention removal | Removes @username style mentions | |
|
|
| Special character removal | Removes emojis and symbols (π, *, %, $, ^, etc.) | |
|
|
| Whitespace normalization | Collapses multiple spaces into one | |
|
|
| Slang word removal | Removes common informal words like "kanka", "lan", etc. | |
|
|
|
|
|
π **Conclusion**: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this. |
|
|
|
|
|
## π‘ Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model") |
|
|
|
|
|
def predict_toxicity(text): |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) |
|
|
outputs = model(**inputs) |
|
|
predicted = torch.argmax(outputs.logits, dim=1).item() |
|
|
return "Toxic" if predicted == 1 else "Non-Toxic" |
|
|
``` |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
- **Trainer**: Hugging Face `Trainer` API |
|
|
- **Epochs**: 3 |
|
|
- **Batch size**: 16 |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Eval Strategy**: Epoch-based |
|
|
- **Undersampling**: Applied to balance class distribution |
|
|
|
|
|
## π Dataset |
|
|
|
|
|
Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) |
|
|
Final dataset size after preprocessing and balancing: 54068 samples |
|
|
|