fc63's picture
Update README.md
1f6eace verified
---
language: tr
tags:
- toxicity
- text-classification
- turkish
- transformers
- bert
license: mit
datasets:
- Overfit-GM/turkish-toxic-language
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: Turkish Toxic Language Detection Model
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: Turkish Toxic Language Dataset
type: Overfit-GM/turkish-toxic-language
metrics:
- name: Accuracy
type: accuracy
value: 0.96
- name: F1
type: f1
value: 0.96
- name: Precision
type: precision
value: 0.96
- name: Recall
type: recall
value: 0.96
---
# πŸ‡ΉπŸ‡· Turkish Toxic Language Detection Model 🧠πŸ”₯
This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in **Turkish** text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset.
## πŸ“Š Performance
| Metric | Non-Toxic | Toxic | Macro Avg |
|--------------|-----------|-------|-----------|
| Precision | 0.96 | 0.95 | 0.96 |
| Recall | 0.95 | 0.96 | 0.96 |
| F1-score | 0.96 | 0.96 | 0.96 |
| Accuracy | | | **0.96** |
| Test Samples | 5400 | 5414 | 10814 |
### Confusion Matrix
| | Pred: Non-Toxic | Pred: Toxic |
|---------------|-----------------|-------------|
| **True: Non-Toxic** | 5154 | 246 |
| **True: Toxic** | 200 | 5214 |
## πŸ§ͺ Preprocessing Details (cleaned_corrected_text)
The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how:
### πŸ”§ Cleaning Function
```python
def clean_corrected_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # URL removal
text = re.sub(r"@\w+", '', text) # remove @mentions
text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis)
text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces
return text
```
### 🧹 Manual Slang Filtering
```python
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]
def remove_slang(text):
for word in slang_words:
text = text.replace(word, "")
return text.strip()
```
### βœ… Applied Steps Summary
| Step | Description |
|------------------------|-------------|
| Lowercasing | All text is converted to lowercase |
| URL removal | Removes links containing http, www, https |
| Mention removal | Removes @username style mentions |
| Special character removal | Removes emojis and symbols (😊, *, %, $, ^, etc.) |
| Whitespace normalization | Collapses multiple spaces into one |
| Slang word removal | Removes common informal words like "kanka", "lan", etc. |
πŸ“Œ **Conclusion**: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.
## πŸ’‘ Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")
def predict_toxicity(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
outputs = model(**inputs)
predicted = torch.argmax(outputs.logits, dim=1).item()
return "Toxic" if predicted == 1 else "Non-Toxic"
```
## πŸ›  Training Details
- **Trainer**: Hugging Face `Trainer` API
- **Epochs**: 3
- **Batch size**: 16
- **Learning Rate**: 2e-5
- **Eval Strategy**: Epoch-based
- **Undersampling**: Applied to balance class distribution
## πŸ“ Dataset
Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language)
Final dataset size after preprocessing and balancing: 54068 samples