File size: 1,939 Bytes
ed1eead
 
a8498e2
 
bba3474
 
 
a0c6e7c
251576b
 
2dbbdf4
bba3474
 
 
251576b
bba3474
cf2a5d1
bba3474
 
 
6aa139e
7413d08
bba3474
 
 
 
 
 
4521bcd
bba3474
 
 
68ce11a
 
 
 
 
 
 
bba3474
7865650
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: mit
language:
- en
---

Update README.md
## Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
(EMNLP 2023 findings)

Paper: https://aclanthology.org/2023.findings-emnlp.138/

Github: https://github.com/deep-over/FiLM

### **FiLM**(**Fi**nancial **L**anguage **M**odel) Models ๐ŸŒŸ

FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time.

To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.

Our model can be called Fin-RoBERTa (Financial RoBERTa).

We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:

**FiLM (2.4B): Our Base Model**

This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. ๐Ÿ“Š

[**FiLM (5.5B): Optimized for SEC Filings**](https://huggingface.co/HYdsl/FiLM-SEC)

This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. ๐Ÿ“‘

The method to load a tokenizer and a model.
For the FiLM model, you can call 'roberta-base' from the tokenizer.
```python
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModel.from_pretrained('HYdsl/FiLM')
```

**Types of Training Corpora ๐Ÿ“š**
![image.png](https://cdn-uploads.huggingface.co/production/uploads/65254614785092cd47b1110b/-cT_wOabHugsct1mogOpa.png)

#Finance #Financial #RoBERTa