Upload FinTurkBERT model

Browse files

Files changed (8) hide show

README.md +139 -0
config.json +35 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +66 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,139 @@

+---
+language:
+- tr
+license: apache-2.0
+tags:
+- sentiment-analysis
+- finance
+- turkish
+- bert
+- financial-nlp
+pipeline_tag: text-classification
+library_name: transformers
+---
+# FinTurkBERT Sentiment
+FinTurkBERT Sentiment is a Turkish financial sentiment model for sentence-level classification from an investor-oriented perspective.
+The model is built on top of a Turkish BERT backbone that was continued-pretrained on approximately 1 GB of cleaned Turkish financial text. After domain-adaptive pretraining, the model was further improved with task-adaptive pretraining (TAPT) and then fine-tuned for 3-class sentiment classification.
+The final released checkpoint was selected from our agreement-level experiments on Turkish Financial PhraseBank-style data. Among the tested variants, the best held-out test performance came from training on the `66%+` agreement subset, which gave the best balance between label quality and data coverage.
+## Labels
+- `0`: negative
+- `1`: neutral
+- `2`: positive
+## Annotation Philosophy
+The sentiment definition follows the Financial PhraseBank viewpoint:
+- sentiment is judged from the perspective of an investor
+- the question is whether the sentence implies negative, neutral, or positive value-relevant impact
+- vague corporate optimism or procedural statements are often treated as `neutral`
+Because of this, the model is relatively conservative and may classify weak or indirect business optimism as neutral unless the positive financial implication is clear.
+## Model Behavior
+This model is intentionally conservative.
+In practice, that means:
+- clearly favorable or clearly adverse financial news is usually classified correctly
+- routine disclosures, procedural updates, and weakly stated corporate optimism often remain `neutral`
+- the model prefers missing some borderline positive signals over producing overly aggressive positive or negative predictions
+This behavior is deliberate and matches the investor-oriented annotation style of Financial PhraseBank, where the threshold for assigning positive or negative sentiment is higher than in generic sentiment analysis.
+## Training Overview
+Training pipeline:
+1. Start from a Turkish BERT base model
+2. Continue pretraining with masked language modeling on approximately 1 GB of Turkish financial text
+3. Apply task-adaptive pretraining on financial task text
+4. Fine-tune for 3-class sentiment classification
+For supervised fine-tuning, we evaluated multiple agreement-level subsets derived from Financial PhraseBank-style annotations:
+- `100%` agreement
+- `75%+` agreement
+- `66%+` agreement
+- `50%+` agreement
+The final model released here is the `66%+` version.
+## Evaluation Summary
+Held-out PhraseBank-style test results for the final `66%+` model:
+- Accuracy: `80.41%`
+- Macro-F1: `0.8028`
+## Intended Use
+This model is intended for:
+- Turkish financial news
+- company announcements
+- investor-facing business reporting
+- market commentary
+- short sentence-level financial sentiment classification
+The primary purpose of the model is to serve as a conservative financial sentiment classifier for Turkish text. It is especially suitable as a first-stage component in a larger NLP system where reliability and controlled sentiment signaling are more important than aggressive polarity detection.
+## Limitations
+- The model is optimized for short financial text, not long-document reasoning.
+- It is conservative by design and may under-predict positive sentiment in softly phrased corporate news.
+- It follows investor-impact sentiment, not generic emotional tone.
+- Performance may drop on very informal, highly speculative, or non-news financial text.
+## Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_id = "ff112/FinTurkBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+id2label = {
+    0: "negative",
+    1: "neutral",
+    2: "positive",
+}
+text = "Sirket yeni bir yatirim anlasmasi imzaladi ve pazardaki konumunu guclendirmeyi hedefliyor."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=64)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    pred_id = int(torch.argmax(logits, dim=-1))
+print(id2label[pred_id])
+```
+## Recommended Interpretation
+For production use, this model works best as a conservative first-stage classifier inside a larger financial NLP pipeline. It is especially suitable when false strong signals are more harmful than missing some borderline positive or negative cases.
+If your application prefers cautious investor-style sentiment labeling, this model is a good fit. If your application instead wants to treat soft growth language, strategic expansion, or optimistic corporate messaging as strongly positive more often, then this model may feel too conservative and should be complemented with a second-stage reviewer or a more relaxed model.
+## Citation
+If you use this model, please cite the accompanying FinTurkBERT work.
+```bibtex
+@misc{finturkbert,
+  title={FinTurkBERT: Domain-Adaptive Pretraining and Sentiment Analysis for Turkish Financial Texts},
+  author={Deniz Topal and Furkan Yasir Goksu and Faruk Akyol},
+  year={2026}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": 0.1,
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "transformers_version": "4.57.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c9bf8de86c3fc9c2661c5ffe450b7c7cb2337804a8de0fc2901e9f1b741fa33
+size 442502140

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,66 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "max_length": 128,
+  "model_max_length": 1000000000,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43814a09433a0c0d5ebc9fe2046d73a12a10454adf2535bfb500c66e5137477d
+size 5496

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff