atsizelti
/

turkish_org_classifier_hand_coded

@@ -1,49 +1,80 @@
-### Model Description
-This model is a fine-tuned version of the dbmdz/bert-base-turkish-uncased architecture, specifically designed for the binary classification task of identifying organizational accounts on Turkish Twitter. It leverages the pre-trained BERT model's understanding of Turkish language and context to effectively distinguish between organizational and non-organizational user accounts.
-### Model Training and Optimization
-Base Model: dbmdz/bert-base-turkish-uncased
-Training Data:  The model was trained and validated using a dataset of Twitter accounts (descriptions, names, screen names) with meticulously annotated labels indicating whether each account belongs to an organization or not.
-### Fine-Tuning Process:
-Data Preprocessing:
-Combined user descriptions, names, and screen names into a single text field for input.
-Data Splitting:
-Split the dataset into 80% for training and 20% for validation.
-Tokenization:
-Utilized the AutoTokenizer from Hugging Face to prepare text inputs for the BERT model.
-Hyperparameter Optimization:
-Employed Optuna to find the best combination of learning rate, batch size, and training epochs, resulting in optimal performance and minimizing validation loss.
-Optimal Hyperparameters:
-Learning Rate: 1.23e-5
-Batch Size: 32
-Epochs: 2
-## Evaluation Results
-The fine-tuned model demonstrates excellent performance on the validation set, achieving the following metrics:
-Precision: 0.945
-Recall: 0.95
-F1-Score (Macro): 0.948
-Accuracy: 0.95
-Confusion Matrix:
-[[369  22]
- [ 19 375]]

+---
+language: "tr"
+tags:
+  - "bert"
+  - "turkish"
+  - "text-classification"
+license: "apache-2.0"
+datasets:
+  - "custom"
+metrics:
+  - "precision"
+  - "recall"
+  - "f1"
+  - "accuracy"
+---
+# BERT-based Organization Detection Model for Turkish Texts
+## Model Description
+This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.
+## Model Architecture
+- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
+- **Training Data:** Twitter data from 3,922 accounts with high organization-related activity as determined by m3inference scores above 0.7. The data was annotated based on user names, screen names, and descriptions by a human annotator.
+## Training Setup
+- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
+- **Dataset Split:** 80% training, 20% validation.
+- **Training Parameters:**
+  - Epochs: 3
+  - Training batch size: 8
+  - Evaluation batch size: 16
+  - Warmup steps: 500
+  - Weight decay: 0.01
+## Hyperparameter Tuning
+Performed using Optuna, achieving best settings:
+- **Learning rate:** 1.2323083424093641e-05
+- **Batch size:** 32
+- **Epochs:** 2
+## Evaluation Metrics
+- **Precision on Validation Set:** 0.94 (organization class)
+- **Recall on Validation Set:** 0.95 (organization class)
+- **F1-Score (Macro Average):** 0.95
+- **Accuracy:** 0.95
+- **Confusion Matrix on Validation Set:**
+ ```
+[[369, 22],
+[19, 375]]
+ ```
+- **Hand-coded Sample of 1000 Accounts:**
+- **Precision:** 0.91
+- **F1-Score (Macro Average):** 0.947
+- **Confusion Matrix:**
+  ```
+  [[936, 3],
+   [ 4, 31]]
+  ```
+## How to Use
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model = AutoModelForSequenceClassification.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
+tokenizer = AutoTokenizer.from_pretrained("atsizelti/atsizelti/turkish_org_classifier_hand_coded")
+text = "Örnek metin buraya girilir."
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model(**inputs)
+predictions = outputs.logits.argmax(-1)
+```