ff112 commited on
Commit
b4f2e51
·
verified ·
1 Parent(s): 195c2ff

Upload FinTurkBERT model

Browse files
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ license: apache-2.0
5
+ tags:
6
+ - sentiment-analysis
7
+ - finance
8
+ - turkish
9
+ - bert
10
+ - financial-nlp
11
+ pipeline_tag: text-classification
12
+ library_name: transformers
13
+ ---
14
+
15
+ # FinTurkBERT Sentiment
16
+
17
+ FinTurkBERT Sentiment is a Turkish financial sentiment model for sentence-level classification from an investor-oriented perspective.
18
+
19
+ The model is built on top of a Turkish BERT backbone that was continued-pretrained on approximately 1 GB of cleaned Turkish financial text. After domain-adaptive pretraining, the model was further improved with task-adaptive pretraining (TAPT) and then fine-tuned for 3-class sentiment classification.
20
+
21
+ The final released checkpoint was selected from our agreement-level experiments on Turkish Financial PhraseBank-style data. Among the tested variants, the best held-out test performance came from training on the `66%+` agreement subset, which gave the best balance between label quality and data coverage.
22
+
23
+ ## Labels
24
+
25
+ - `0`: negative
26
+ - `1`: neutral
27
+ - `2`: positive
28
+
29
+ ## Annotation Philosophy
30
+
31
+ The sentiment definition follows the Financial PhraseBank viewpoint:
32
+
33
+ - sentiment is judged from the perspective of an investor
34
+ - the question is whether the sentence implies negative, neutral, or positive value-relevant impact
35
+ - vague corporate optimism or procedural statements are often treated as `neutral`
36
+
37
+ Because of this, the model is relatively conservative and may classify weak or indirect business optimism as neutral unless the positive financial implication is clear.
38
+
39
+ ## Model Behavior
40
+
41
+ This model is intentionally conservative.
42
+
43
+ In practice, that means:
44
+
45
+ - clearly favorable or clearly adverse financial news is usually classified correctly
46
+ - routine disclosures, procedural updates, and weakly stated corporate optimism often remain `neutral`
47
+ - the model prefers missing some borderline positive signals over producing overly aggressive positive or negative predictions
48
+
49
+ This behavior is deliberate and matches the investor-oriented annotation style of Financial PhraseBank, where the threshold for assigning positive or negative sentiment is higher than in generic sentiment analysis.
50
+
51
+ ## Training Overview
52
+
53
+ Training pipeline:
54
+
55
+ 1. Start from a Turkish BERT base model
56
+ 2. Continue pretraining with masked language modeling on approximately 1 GB of Turkish financial text
57
+ 3. Apply task-adaptive pretraining on financial task text
58
+ 4. Fine-tune for 3-class sentiment classification
59
+
60
+ For supervised fine-tuning, we evaluated multiple agreement-level subsets derived from Financial PhraseBank-style annotations:
61
+
62
+ - `100%` agreement
63
+ - `75%+` agreement
64
+ - `66%+` agreement
65
+ - `50%+` agreement
66
+
67
+ The final model released here is the `66%+` version.
68
+
69
+ ## Evaluation Summary
70
+
71
+ Held-out PhraseBank-style test results for the final `66%+` model:
72
+
73
+ - Accuracy: `80.41%`
74
+ - Macro-F1: `0.8028`
75
+
76
+ ## Intended Use
77
+
78
+ This model is intended for:
79
+
80
+ - Turkish financial news
81
+ - company announcements
82
+ - investor-facing business reporting
83
+ - market commentary
84
+ - short sentence-level financial sentiment classification
85
+
86
+ The primary purpose of the model is to serve as a conservative financial sentiment classifier for Turkish text. It is especially suitable as a first-stage component in a larger NLP system where reliability and controlled sentiment signaling are more important than aggressive polarity detection.
87
+
88
+ ## Limitations
89
+
90
+ - The model is optimized for short financial text, not long-document reasoning.
91
+ - It is conservative by design and may under-predict positive sentiment in softly phrased corporate news.
92
+ - It follows investor-impact sentiment, not generic emotional tone.
93
+ - Performance may drop on very informal, highly speculative, or non-news financial text.
94
+
95
+ ## Example Usage
96
+
97
+ ```python
98
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
99
+ import torch
100
+
101
+ model_id = "ff112/FinTurkBERT"
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
104
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
105
+
106
+ id2label = {
107
+ 0: "negative",
108
+ 1: "neutral",
109
+ 2: "positive",
110
+ }
111
+
112
+ text = "Sirket yeni bir yatirim anlasmasi imzaladi ve pazardaki konumunu guclendirmeyi hedefliyor."
113
+
114
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=64)
115
+
116
+ with torch.no_grad():
117
+ logits = model(**inputs).logits
118
+ pred_id = int(torch.argmax(logits, dim=-1))
119
+
120
+ print(id2label[pred_id])
121
+ ```
122
+
123
+ ## Recommended Interpretation
124
+
125
+ For production use, this model works best as a conservative first-stage classifier inside a larger financial NLP pipeline. It is especially suitable when false strong signals are more harmful than missing some borderline positive or negative cases.
126
+
127
+ If your application prefers cautious investor-style sentiment labeling, this model is a good fit. If your application instead wants to treat soft growth language, strategic expansion, or optimistic corporate messaging as strongly positive more often, then this model may feel too conservative and should be complemented with a second-stage reviewer or a more relaxed model.
128
+
129
+ ## Citation
130
+
131
+ If you use this model, please cite the accompanying FinTurkBERT work.
132
+
133
+ ```bibtex
134
+ @misc{finturkbert,
135
+ title={FinTurkBERT: Domain-Adaptive Pretraining and Sentiment Analysis for Turkish Financial Texts},
136
+ author={Deniz Topal and Furkan Yasir Goksu and Faruk Akyol},
137
+ year={2026}
138
+ }
139
+ ```
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": 0.1,
7
+ "dtype": "float32",
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "LABEL_0": 0,
20
+ "LABEL_1": 1,
21
+ "LABEL_2": 2
22
+ },
23
+ "layer_norm_eps": 1e-12,
24
+ "max_position_embeddings": 512,
25
+ "model_type": "bert",
26
+ "num_attention_heads": 12,
27
+ "num_hidden_layers": 12,
28
+ "pad_token_id": 0,
29
+ "position_embedding_type": "absolute",
30
+ "problem_type": "single_label_classification",
31
+ "transformers_version": "4.57.3",
32
+ "type_vocab_size": 2,
33
+ "use_cache": true,
34
+ "vocab_size": 32000
35
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c9bf8de86c3fc9c2661c5ffe450b7c7cb2337804a8de0fc2901e9f1b741fa33
3
+ size 442502140
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_len": 512,
51
+ "max_length": 128,
52
+ "model_max_length": 1000000000,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "[SEP]",
59
+ "stride": 0,
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
+ "unk_token": "[UNK]"
66
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43814a09433a0c0d5ebc9fe2046d73a12a10454adf2535bfb500c66e5137477d
3
+ size 5496
vocab.txt ADDED
The diff for this file is too large to render. See raw diff