Instructions to use HYdsl/FiLM-SEC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HYdsl/FiLM-SEC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="HYdsl/FiLM-SEC")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("HYdsl/FiLM-SEC") model = AutoModelForMaskedLM.from_pretrained("HYdsl/FiLM-SEC") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| --- | |
| Update README.md | |
| ## Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models | |
| (EMNLP 2023 findings) | |
| Paper: https://aclanthology.org/2023.findings-emnlp.138/ | |
| Github: https://github.com/deep-over/FiLM | |
| ### **FiLM**(**Fi**nancial **L**anguage **M**odel) Models π | |
| FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time. | |
| To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance. | |
| Our model can be called Fin-RoBERTa (Financial RoBERTa). | |
| We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain: | |
| [**FiLM (2.4B): Our Base Model**](https://huggingface.co/HYdsl/FiLM) | |
| This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. π | |
| **FiLM (5.5B): Optimized for SEC Filings** | |
| This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. π | |
| The method to load a tokenizer and a model. | |
| For the FiLM model, you can call 'roberta-base' from the tokenizer. | |
| ```python | |
| tokenizer = AutoTokenizer.from_pretrained('roberta-base') | |
| model = AutoModel.from_pretrained('HYdsl/FiLM-SEC') | |
| ``` | |
| **Types of Training Corpora π** | |
|  |