--- language: tr tags: - toxicity - text-classification - turkish - transformers - bert license: mit datasets: - Overfit-GM/turkish-toxic-language metrics: - accuracy - f1 - precision - recall model-index: - name: Turkish Toxic Language Detection Model results: - task: type: text-classification name: Text Classification dataset: name: Turkish Toxic Language Dataset type: Overfit-GM/turkish-toxic-language metrics: - name: Accuracy type: accuracy value: 0.96 - name: F1 type: f1 value: 0.96 - name: Precision type: precision value: 0.96 - name: Recall type: recall value: 0.96 --- # ๐Ÿ‡น๐Ÿ‡ท Turkish Toxic Language Detection Model ๐Ÿง ๐Ÿ”ฅ This model is a fine-tuned version of [`dbmdz/bert-base-turkish-cased`](https://huggingface.co/dbmdz/bert-base-turkish-cased) for binary toxicity classification in **Turkish** text. It was trained using a cleaned and preprocessed version of the [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) dataset. ## ๐Ÿ“Š Performance | Metric | Non-Toxic | Toxic | Macro Avg | |--------------|-----------|-------|-----------| | Precision | 0.96 | 0.95 | 0.96 | | Recall | 0.95 | 0.96 | 0.96 | | F1-score | 0.96 | 0.96 | 0.96 | | Accuracy | | | **0.96** | | Test Samples | 5400 | 5414 | 10814 | ### Confusion Matrix | | Pred: Non-Toxic | Pred: Toxic | |---------------|-----------------|-------------| | **True: Non-Toxic** | 5154 | 246 | | **True: Toxic** | 200 | 5214 | ## ๐Ÿงช Preprocessing Details (cleaned_corrected_text) The model is trained on the `cleaned_corrected_text` column, which is derived from `corrected_text` using basic regex-based cleaning steps and manual slang filtering. Here's how: ### ๐Ÿ”ง Cleaning Function ```python def clean_corrected_text(text): text = text.lower() text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # URL removal text = re.sub(r"@\w+", '', text) # remove @mentions text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis) text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces return text ``` ### ๐Ÿงน Manual Slang Filtering ```python slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"] def remove_slang(text): for word in slang_words: text = text.replace(word, "") return text.strip() ``` ### โœ… Applied Steps Summary | Step | Description | |------------------------|-------------| | Lowercasing | All text is converted to lowercase | | URL removal | Removes links containing http, www, https | | Mention removal | Removes @username style mentions | | Special character removal | Removes emojis and symbols (๐Ÿ˜Š, *, %, $, ^, etc.) | | Whitespace normalization | Collapses multiple spaces into one | | Slang word removal | Removes common informal words like "kanka", "lan", etc. | ๐Ÿ“Œ **Conclusion**: `cleaned_corrected_text` is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this. ## ๐Ÿ’ก Example Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model") model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model") def predict_toxicity(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) outputs = model(**inputs) predicted = torch.argmax(outputs.logits, dim=1).item() return "Toxic" if predicted == 1 else "Non-Toxic" ``` ## ๐Ÿ›  Training Details - **Trainer**: Hugging Face `Trainer` API - **Epochs**: 3 - **Batch size**: 16 - **Learning Rate**: 2e-5 - **Eval Strategy**: Epoch-based - **Undersampling**: Applied to balance class distribution ## ๐Ÿ“ Dataset Dataset used: [`Overfit-GM/turkish-toxic-language`](https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language) Final dataset size after preprocessing and balancing: 54068 samples