turkish-toxic-language-detection / README.md

Update README.md

1f6eace verified 7 months ago

4.45 kB


	---
	language: tr
	tags:
	- toxicity
	- text-classification
	- turkish
	- transformers
	- bert
	license: mit
	datasets:
	- Overfit-GM/turkish-toxic-language
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: Turkish Toxic Language Detection Model
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: Turkish Toxic Language Dataset
	type: Overfit-GM/turkish-toxic-language
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.96
	- name: F1
	type: f1
	value: 0.96
	- name: Precision
	type: precision
	value: 0.96
	- name: Recall
	type: recall
	value: 0.96
	---

	# 🇹🇷 Turkish Toxic Language Detection Model 🧠🔥

	This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in Turkish text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset.

	## 📊 Performance

	\| Metric \| Non-Toxic \| Toxic \| Macro Avg \|
	\|--------------\|-----------\|-------\|-----------\|
	\| Precision \| 0.96 \| 0.95 \| 0.96 \|
	\| Recall \| 0.95 \| 0.96 \| 0.96 \|
	\| F1-score \| 0.96 \| 0.96 \| 0.96 \|
	\| Accuracy \| \| \| 0.96 \|
	\| Test Samples \| 5400 \| 5414 \| 10814 \|

	### Confusion Matrix

	\| \| Pred: Non-Toxic \| Pred: Toxic \|
	\|---------------\|-----------------\|-------------\|
	\| True: Non-Toxic \| 5154 \| 246 \|
	\| True: Toxic \| 200 \| 5214 \|

	## 🧪 Preprocessing Details (cleaned_corrected_text)

	The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how:

	### 🔧 Cleaning Function

	```python
	def clean_corrected_text(text):
	text = text.lower()
	text = re.sub(r"http\S+\|www\S+\|https\S+", '', text, flags=re.MULTILINE) # URL removal
	text = re.sub(r"@\w+", '', text) # remove @mentions
	text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis)
	text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces
	return text
	```

	### 🧹 Manual Slang Filtering

	```python
	slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]

	def remove_slang(text):
	for word in slang_words:
	text = text.replace(word, "")
	return text.strip()
	```

	### ✅ Applied Steps Summary

	\| Step \| Description \|
	\|------------------------\|-------------\|
	\| Lowercasing \| All text is converted to lowercase \|
	\| URL removal \| Removes links containing http, www, https \|
	\| Mention removal \| Removes @username style mentions \|
	\| Special character removal \| Removes emojis and symbols (😊, *, %, $, ^, etc.) \|
	\| Whitespace normalization \| Collapses multiple spaces into one \|
	\| Slang word removal \| Removes common informal words like "kanka", "lan", etc. \|

	📌 Conclusion: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.

	## 💡 Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
	model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")

	def predict_toxicity(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
	outputs = model(**inputs)
	predicted = torch.argmax(outputs.logits, dim=1).item()
	return "Toxic" if predicted == 1 else "Non-Toxic"
	```

	## 🛠 Training Details

	- Trainer: Hugging Face `Trainer` API
	- Epochs: 3
	- Batch size: 16
	- Learning Rate: 2e-5
	- Eval Strategy: Epoch-based
	- Undersampling: Applied to balance class distribution

	## 📁 Dataset

	Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language)
	Final dataset size after preprocessing and balancing: 54068 samples