HYdsl commited on
Commit
2a53bd3
ยท
1 Parent(s): 526d08a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md CHANGED
@@ -1,3 +1,33 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
  ---
6
+ ---
7
+
8
+ Update README.md
9
+ ## Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models
10
+ (EMNLP 2023 findings)
11
+
12
+ Paper: https://arxiv.org/abs/2310.13312
13
+
14
+ Github: https://github.com/deep-over/FiLM
15
+
16
+ ### **FiLM**(**Fi**nancial **L**anguage **M**odel) Models ๐ŸŒŸ
17
+
18
+ FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time.
19
+
20
+ To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.
21
+
22
+ We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:
23
+
24
+ [**FiLM (2.4B): Our Base Model**](https://huggingface.co/HYdsl/FiLM)
25
+
26
+ This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. ๐Ÿ“Š
27
+
28
+ **FiLM (5.5B): Optimized for SEC Filings**
29
+
30
+ This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. ๐Ÿ“‘
31
+
32
+ **Types of Training Corpora ๐Ÿ“š**
33
+ ![image.png](https://cdn-uploads.huggingface.co/production/uploads/65254614785092cd47b1110b/-cT_wOabHugsct1mogOpa.png)