atahanuz commited on
Commit
d405036
·
verified ·
1 Parent(s): 064ec18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -35
README.md CHANGED
@@ -6,72 +6,89 @@ tags:
6
  - bert
7
  - offensive-language-detection
8
  - turkish
 
9
  datasets:
10
  - offenseval-tr
11
  metrics:
12
  - accuracy
13
  - f1
14
- model_name: atahanuz/bert-classifier
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  base_model: boun-tabilab/TabiBERT
16
  ---
17
 
18
- # atahanuz/bert-classifier
19
 
20
- This model is a fine-tuned version of [boun-tabilab/TabiBERT](https://huggingface.co/boun-tabilab/TabiBERT) on the **OffensEval-2020-TR** dataset . It is designed to detect offensive language in Turkish text.
21
 
22
- ## Model Details
23
 
24
- - **Model:** BERT (TabiBERT)
25
- - **Language:** Turkish
26
- - **Task:** Binary Classification (Offensive vs Not Offensive)
27
- - **Trained by:** atahanuz
28
- - **Dataset Size:**
29
- - Training: 31,277 samples
30
- - Test: 3,529 samples
31
 
32
- ## Performance
33
 
34
- The model achieved the following results on the evaluation set:
35
 
36
- - **Accuracy:** 0.936
37
- - **F1 Score:** 0.912
38
 
39
- ## Label Mapping
 
40
 
41
- | Label ID | Label Name | Meaning |
42
- | :--- | :--- | :--- |
43
- | 0 | **NOT** | Not Offensive |
44
- | 1 | **OFF** | Offensive |
 
 
45
 
46
- ## Usage
 
 
47
 
48
- You can use this model directly with the Hugging Face `transformers` library.
49
 
50
- ### Single Input Prediction
51
 
52
  ```python
53
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
54
  import torch
55
 
56
- # Load model and tokenizer
57
- model_name = "atahanuz/bert-offensive-classifier"
58
  tokenizer = AutoTokenizer.from_pretrained(model_name)
59
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
60
 
61
- # Define label mapping (0: NOT, 1: OFF)
62
  id2label = {0: "NOT", 1: "OFF"}
63
 
64
- # Input text
65
- text = "Bu harika bir filmdi, çok beğendim." # Example: "This was a great movie, I liked it a lot."
66
- # text = "Allah belanı versin." # Example of offensive text
67
-
68
- # Tokenize and predict
69
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
70
 
71
  with torch.no_grad():
72
  logits = model(**inputs).logits
73
 
74
- # Get predicted class
75
  predicted_class_id = logits.argmax().item()
76
  predicted_label = id2label[predicted_class_id]
77
  confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
@@ -80,8 +97,49 @@ print(f"Text: {text}")
80
  print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
81
  ```
82
 
83
- ## Reference
84
 
85
- If you use this model or dataset, please cite the OffensEval-2020 paper:
86
 
87
- [SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)](https://arxiv.org/pdf/2006.07235)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - bert
7
  - offensive-language-detection
8
  - turkish
9
+ - boun-tabilab
10
  datasets:
11
  - offenseval-tr
12
  metrics:
13
  - accuracy
14
  - f1
15
+ model-index:
16
+ - name: atahanuz/bert-classifier
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Text Classification
21
+ dataset:
22
+ name: OffensEval-2020-TR
23
+ type: offenseval-tr
24
+ metrics:
25
+ - name: Accuracy
26
+ type: accuracy
27
+ value: 0.936
28
+ - name: F1
29
+ type: f1
30
+ value: 0.912
31
  base_model: boun-tabilab/TabiBERT
32
  ---
33
 
34
+ # Turkish Offensive Language Classifier (BERT)
35
 
36
+ This model is a fine-tuned version of [**boun-tabilab/TabiBERT**](https://huggingface.co/boun-tabilab/TabiBERT) trained on the **OffensEval-2020-TR** dataset. It is designed to perform binary classification to detect offensive language in Turkish text.
37
 
38
+ ## 📊 Model Details
39
 
40
+ | Feature | Description |
41
+ | :--- | :--- |
42
+ | **Model Architecture** | BERT (Base Uncased Turkish - TabiBERT) |
43
+ | **Task** | Binary Text Classification (Offensive vs. Not Offensive) |
44
+ | **Language** | Turkish (tr) |
45
+ | **Dataset** | OffensEval 2020 (Turkish Subtask) |
46
+ | **Trained By** | atahanuz |
47
 
48
+ ## 🚀 Usage
49
 
50
+ The easiest way to use this model is via the Hugging Face `pipeline`.
51
 
52
+ ### Method 1: Using the Pipeline (Recommended)
 
53
 
54
+ ```python
55
+ from transformers import pipeline
56
 
57
+ # Initialize the pipeline
58
+ classifier = pipeline("text-classification", model="atahanuz/bert-classifier")
59
+
60
+ # Predict
61
+ text = "Bu harika bir filmdi, çok beğendim."
62
+ result = classifier(text)
63
 
64
+ print(result)
65
+ # Output: [{'label': 'NOT', 'score': 0.99...}]
66
+ ```
67
 
68
+ ### Method 2: Manual PyTorch Implementation
69
 
70
+ If you need more control over the tokens or logits, use the standard `AutoModel` approach:
71
 
72
  ```python
73
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
74
  import torch
75
 
76
+ # 1. Load model and tokenizer
77
+ model_name = "atahanuz/bert-classifier"
78
  tokenizer = AutoTokenizer.from_pretrained(model_name)
79
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
80
 
81
+ # 2. Define label mapping
82
  id2label = {0: "NOT", 1: "OFF"}
83
 
84
+ # 3. Tokenize and predict
85
+ text = "Bu harika bir filmdi, çok beğendim." # Example text
 
 
 
86
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
87
 
88
  with torch.no_grad():
89
  logits = model(**inputs).logits
90
 
91
+ # 4. Get results
92
  predicted_class_id = logits.argmax().item()
93
  predicted_label = id2label[predicted_class_id]
94
  confidence = torch.softmax(logits, dim=1)[0][predicted_class_id].item()
 
97
  print(f"Prediction: {predicted_label} (Confidence: {confidence:.4f})")
98
  ```
99
 
100
+ ## 🏷️ Label Mapping
101
 
102
+ The model outputs the following labels:
103
 
104
+ | Label ID | Label Name | Description |
105
+ | :--- | :--- | :--- |
106
+ | `0` | **NOT** | **Not Offensive** - Normal, non-hateful speech. |
107
+ | `1` | **OFF** | **Offensive** - Contains insults, threats, or inappropriate language. |
108
+
109
+ ## 📈 Performance
110
+
111
+ The model was evaluated on the test split of the OffensEval-2020-TR dataset (approx. 3,500 samples).
112
+
113
+ - **Accuracy:** `93.6%`
114
+ - **F1 Score:** `91.2%`
115
+
116
+ ### Dataset Statistics
117
+ - **Training Samples:** 31,277
118
+ - **Test Samples:** 3,529
119
+
120
+ ## ⚠️ Limitations and Bias
121
+
122
+ * **Context Sensitivity:** Like many BERT models, this classifier may struggle with sarcasm or offensive language that depends heavily on context not present in the input sentence.
123
+ * **Dataset Bias:** The model is trained on social media data (OffensEval). It may reflect biases present in that specific dataset or struggle with formal/archaic Turkish.
124
+ * **False Positives:** Certain colloquialisms or "tough love" expressions might be misclassified as offensive.
125
+
126
+ ## 📚 Citation
127
+
128
+ If you use this model or the dataset, please cite the original OffensEval paper:
129
+
130
+ ```bibtex
131
+ @inproceedings{zampieri-etal-2020-semeval,
132
+ title = "{SemEval}-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({OffensEval} 2020)",
133
+ author = "Zampieri, Marcos and
134
+ Nakov, Preslav and
135
+ Rosenthal, Sara and
136
+ Atanasova, Pepa and
137
+ Karadzhov, Georgi and
138
+ Mubarak, Hamdy and
139
+ Derczynski, Leon and
140
+ Pym, Z",
141
+ booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
142
+ year = "2020",
143
+ publisher = "International Committee for Computational Linguistics",
144
+ }
145
+ ```