atahanuz
/

bert-offensive-classifier

@@ -6,72 +6,89 @@ tags:
 - bert
 - offensive-language-detection
 - turkish
 datasets:
 - offenseval-tr
 metrics:
 - accuracy
 - f1
-model_name: atahanuz/bert-classifier
 base_model: boun-tabilab/TabiBERT
 ---
-# atahanuz/bert-classifier
-This model is a fine-tuned version of [boun-tabilab/TabiBERT](https://huggingface.co/boun-tabilab/TabiBERT) on the **OffensEval-2020-TR** dataset . It is designed to detect offensive language in Turkish text.
-## Model Details
--   **Model:** BERT (TabiBERT)
--   **Language:** Turkish
--   **Task:** Binary Classification (Offensive vs Not Offensive)
--   **Trained by:** atahanuz
--   **Dataset Size:**
-    -   Training: 31,277 samples
-    -   Test: 3,529 samples
-## Performance
-The model achieved the following results on the evaluation set:
--   **Accuracy:** 0.936
--   **F1 Score:** 0.912
-## Label Mapping
-| Label ID | Label Name | Meaning |
-| :--- | :--- | :--- |
-| 0 | **NOT** | Not Offensive |
-| 1 | **OFF** | Offensive |
-## Usage
-You can use this model directly with the Hugging Face `transformers` library.
-### Single Input Prediction
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
-# Load model and tokenizer
-model_name = "atahanuz/bert-offensive-classifier"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForSequenceClassification.from_pretrained(model_name)
-# Define label mapping (0: NOT, 1: OFF)
 id2label = {0: "NOT", 1: "OFF"}
-# Input text
-text = "Bu harika bir filmdi, çok beğendim." # Example: "This was a great movie, I liked it a lot."
-# text = "Allah belanı versin." # Example of offensive text
-# Tokenize and predict
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
 with torch.no_grad():
     logits = model(**inputs).logits
-# Get predicted class
 predicted_class_id = logits.argmax().item()
 predicted_label = id2label[predicted_class_id]
 confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
@@ -80,8 +97,49 @@ print(f"Text: {text}")
 print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
 ```
-## Reference
-If you use this model or dataset, please cite the OffensEval-2020 paper:
-[SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)](https://arxiv.org/pdf/2006.07235)

 - bert
 - offensive-language-detection
 - turkish
+- boun-tabilab
 datasets:
 - offenseval-tr
 metrics:
 - accuracy
 - f1
+model-index:
+- name: atahanuz/bert-classifier
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      name: OffensEval-2020-TR
+      type: offenseval-tr
+    metrics:
+      - name: Accuracy
+        type: accuracy
+        value: 0.936
+      - name: F1
+        type: f1
+        value: 0.912
 base_model: boun-tabilab/TabiBERT
 ---
+# Turkish Offensive Language Classifier (BERT)
+This model is a fine-tuned version of [**boun-tabilab/TabiBERT**](https://huggingface.co/boun-tabilab/TabiBERT) trained on the **OffensEval-2020-TR** dataset. It is designed to perform binary classification to detect offensive language in Turkish text.
+## 📊 Model Details
+| Feature | Description |
+| :--- | :--- |
+| **Model Architecture** | BERT (Base Uncased Turkish - TabiBERT) |
+| **Task** | Binary Text Classification (Offensive vs. Not Offensive) |
+| **Language** | Turkish (tr) |
+| **Dataset** | OffensEval 2020 (Turkish Subtask) |
+| **Trained By** | atahanuz |
+## 🚀 Usage
+The easiest way to use this model is via the Hugging Face `pipeline`.
+### Method 1: Using the Pipeline (Recommended)
+```python
+from transformers import pipeline
+# Initialize the pipeline
+classifier = pipeline("text-classification", model="atahanuz/bert-classifier")
+# Predict
+text = "Bu harika bir filmdi, çok beğendim."
+result = classifier(text)
+print(result)
+# Output: [{'label': 'NOT', 'score': 0.99...}]
+```
+### Method 2: Manual PyTorch Implementation
+If you need more control over the tokens or logits, use the standard `AutoModel` approach:
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
+# 1. Load model and tokenizer
+model_name = "atahanuz/bert-classifier"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# 2. Define label mapping
 id2label = {0: "NOT", 1: "OFF"}
+# 3. Tokenize and predict
+text = "Bu harika bir filmdi, çok beğendim." # Example text
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
 with torch.no_grad():
     logits = model(**inputs).logits
+# 4. Get results
 predicted_class_id = logits.argmax().item()
 predicted_label = id2label[predicted_class_id]
 confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
 print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
 ```
+## 🏷️ Label Mapping
+The model outputs the following labels:
+| Label ID | Label Name | Description |
+| :--- | :--- | :--- |
+| `0` | **NOT** | **Not Offensive** - Normal, non-hateful speech. |
+| `1` | **OFF** | **Offensive** - Contains insults, threats, or inappropriate language. |
+## 📈 Performance
+The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples).
+- **Accuracy:** `93.6%`
+- **F1 Score:** `91.2%`
+### Dataset Statistics
+- **Training Samples:** 31,277
+- **Test Samples:** 3,529
+## ⚠️ Limitations and Bias
+* **Context Sensitivity:** Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence.
+* **Dataset Bias:** The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish.
+* **False Positives:** Certain colloquialisms or "tough love" expressions might be misclassified as offensive.
+## 📚 Citation
+If you use this model or the dataset, please cite the original OffensEval paper:
+```bibtex
+@inproceedings{zampieri-etal-2020-semeval,
+    title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)",
+    author = "Zampieri, Marcos  and
+      Nakov, Preslav  and
+      Rosenthal, Sara  and
+      Atanasova, Pepa  and
+      Karadzhov, Georgi  and
+      Mubarak, Hamdy  and
+      Derczynski, Leon  and
+      Pym, Z",
+    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
+    year = "2020",
+    publisher = "International Committee for Computational Linguistics",
+}
+```