HYdsl
/

FiLM-SEC

Model card Files Files and versions

FiLM-SEC / README.md

HYdsl's picture

Update README.md

637da44 verified almost 2 years ago

|

history blame contribute delete

1.91 kB

	---
	license: mit
	language:
	- en
	---
	---

	Update README.md
	## Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
	(EMNLP 2023 findings)

	Paper: https://aclanthology.org/2023.findings-emnlp.138/

	Github: https://github.com/deep-over/FiLM

	### FiLM(Financial Language Model) Models 🌟

	FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time.

	To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.

	Our model can be called Fin-RoBERTa (Financial RoBERTa).

	We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:

	[FiLM (2.4B): Our Base Model](https://huggingface.co/HYdsl/FiLM)

	This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊

	FiLM (5.5B): Optimized for SEC Filings

	This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. 📑

	The method to load a tokenizer and a model.
	For the FiLM model, you can call 'roberta-base' from the tokenizer.
	```python
	tokenizer = AutoTokenizer.from_pretrained('roberta-base')
	model = AutoModel.from_pretrained('HYdsl/FiLM-SEC')
	```

	Types of Training Corpora 📚
	![image.png](https://cdn-uploads.huggingface.co/production/uploads/65254614785092cd47b1110b/-cT_wOabHugsct1mogOpa.png)