Fill-Mask
Transformers
PyTorch
German
bert
scherrmann commited on
Commit
173c369
·
1 Parent(s): 2c6ac43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -1
README.md CHANGED
@@ -1,3 +1,56 @@
1
- ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  license: apache-2.0
2
+ language:
3
+ - de
4
  ---
5
+ # German FinBERT (Pre-trained From Scratch Version)
6
+
7
+ German FinBERT is a BERT language model focusing on the financial domain within the German language. In my [paper](https://arxiv.org/pdf/2311.08793.pdf), I describe in more detail the steps taken to train the model and show that it outperforms its generic benchmarks for finance specific downstream tasks.
8
+ This version of German FinBERT is pre-trained from scratch on German finance specific textual data, starting with the Bert-base architecture and the vocabulary of the [bert-base-german-cased model](https://huggingface.co/bert-base-german-cased) of Deepset.
9
+
10
+ ## Overview
11
+ **Author** Moritz Scherrmann
12
+ **Paper:** [here](https://arxiv.org/pdf/2311.08793.pdf)
13
+ **Architecture:** BERT base
14
+ **Language:** German
15
+ **Specialization:** Financial textual data
16
+ **Framework:** [MosaicML](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
17
+
18
+ ## Pre-training
19
+ German FinBERT's pre-training corpus includes a diverse range of financial documents, such as Bundesanzeiger reports, Handelsblatt articles, MarketScreener data, and additional sources including FAZ, ad-hoc announcements, LexisNexis & Event Registry content, Zeit Online articles, Wikipedia entries, and Gabler Wirtschaftslexikon. In total, the corpus spans from 1996 to 2023, consisting of 12.15 million documents with 10.12 billion tokens over 53.19 GB.
20
+
21
+ With a batch size of 4096, I train the German FinBERT model for 174,000 steps, summing up to more than 17 epochs. I use an Adam optimizer with decoupled weight decay regularization, with Adam parameters 0.9, 0.98, 1e − 6, a weight
22
+ decay of 1e − 5 and a maximal learning of 5e − 4. I train the model using a Nvidia DGX A100 node consisting of 8 A100 GPUs with 80 GB of memory each.
23
+
24
+ ## Performance
25
+ ### Fine-tune Datasets
26
+ To fine-tune the model, I use several datasets, including:
27
+ - A manually labeled [multi-label database of German ad-hoc announcements](https://arxiv.org/pdf/2311.07598.pdf) containing 31,771 sentences, each associated with up to 20 possible topics.
28
+ - An extractive question-answering dataset based on the SQuAD format, which was created using 3,044 ad-hoc announcements processed by OpenAI's ChatGPT to generate and answer questions (see [here](https://huggingface.co/datasets/scherrmann/adhoc_quad)).
29
+ - The [financial phrase bank](https://arxiv.org/abs/1307.5336) of Malo et al. (2013) for sentiment classification, translated to German using [DeepL](https://www.deepl.com/translator)
30
+
31
+ ### Benchmark Results
32
+ The further pre-trained German FinBERT model demonstrated the following performances on finance-specific downstream tasks:
33
+
34
+ Ad-Hoc Multi-Label Database:
35
+ - Macro F1: 85.67%
36
+ - Micro F1: 85.17%
37
+
38
+ Ad-Hoc QuAD (Question Answering):
39
+ - Exact Match (EM): 50.23%
40
+ - F1 Score: 72.80%
41
+
42
+ Translated Financial Phrase Bank:
43
+ - Accuracy: 95.95%
44
+ - Macro F1: 92.70%
45
+
46
+ ## Authors
47
+ Moritz Scherrmann: `scherrmann [at] lmu.de`
48
+
49
+
50
+ For additional details regarding the performance on fine-tune datasets and benchmark results, please refer to the full documentation provided in the study.
51
+
52
+ See also:
53
+ - scherrmann/GermanFinBERT_FP
54
+ - scherrmann/GermanFinBERT_FP_Topic
55
+ - scherrmann/GermanFinBERT_FP_QuAD
56
+ - scherrmann/GermanFinBERT_SC_Sentiment