File size: 2,236 Bytes
d4d4df0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language: "tr"
tags:
  - "bert"
  - "turkish"
  - "text-classification"
license: "apache-2.0"
datasets:
  - "custom"
metrics:
  - "precision"
  - "recall"
  - "f1"
  - "accuracy"
---


# BERT-based Organization Detection Model for Turkish Texts

## Model Description

This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.

## Model Architecture

- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
- **Training Data:** Twitter data from 3,922 accounts with high organization-related activity as determined by m3inference scores above 0.7. The data was annotated based on user names, screen names, and descriptions by a human annotator.

## Training Setup

- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
- **Dataset Split:** 80% training, 20% validation.
- **Training Parameters:** 
  - Epochs: 3
  - Training batch size: 8
  - Evaluation batch size: 16
  - Warmup steps: 500
  - Weight decay: 0.01

## Hyperparameter Tuning

Performed using Optuna, achieving best settings:
- **Learning rate:** 1.2323083424093641e-05
- **Batch size:** 32
- **Epochs:** 2

## Evaluation Metrics

- **Precision on Validation Set:** 0.94 (organization class)
- **Recall on Validation Set:** 0.95 (organization class)
- **F1-Score (Macro Average):** 0.95
- **Accuracy:** 0.95
- **Confusion Matrix on Validation Set:**
 ```
[[369, 22],
[19, 375]]
 ```

- **Hand-coded Sample of 1000 Accounts:**
- **Precision:** 0.91
- **F1-Score (Macro Average):** 0.947
- **Confusion Matrix:**
  ```
  [[936, 3],
   [ 4, 31]]
  ```

## How to Use

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
tokenizer = AutoTokenizer.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")

text = "Örnek metin buraya girilir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
```