| | --- |
| | pt: pt-br |
| | license: mit |
| | library_name: transformers |
| | tags: |
| | - portuguese |
| | - financial |
| | - bert |
| | - deberta |
| | - nlp |
| | - fill-mask |
| | - masked-lm |
| | datasets: |
| | - FAKE.BR |
| | - CAROSIA |
| | - BBRC |
| | - OFFCOMBR-3 |
| | metrics: |
| | - f1 |
| | - precision |
| | - recall |
| | - pr_auc |
| | model-index: |
| | - name: DeB3RTa-base |
| | results: |
| | - task: |
| | type: text-classification |
| | name: Fake News Detection |
| | dataset: |
| | type: FAKE.BR |
| | name: FAKE.BR |
| | metrics: |
| | - type: f1 |
| | value: 0.9598 |
| |
|
| | - task: |
| | type: text-classification |
| | name: Sentiment Analysis |
| | dataset: |
| | type: CAROSIA |
| | name: CAROSIA |
| | metrics: |
| | - type: f1 |
| | value: 0.8722 |
| |
|
| | - task: |
| | type: text-classification |
| | name: Regulatory Classification |
| | dataset: |
| | type: BBRC |
| | name: BBRC |
| | metrics: |
| | - type: f1 |
| | value: 0.6712 |
| |
|
| | - task: |
| | type: text-classification |
| | name: Hate Speech Detection |
| | dataset: |
| | type: OFFCOMBR-3 |
| | name: OFFCOMBR-3 |
| | metrics: |
| | - type: f1 |
| | value: 0.5460 |
| |
|
| | inference: true |
| | --- |
| | |
| | # DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain |
| |
|
| | DeB3RTa is a family of transformer-based language models specifically designed for Portuguese financial text processing. These models are built on the DeBERTa-v2 architecture and trained using a comprehensive mixed-domain pretraining strategy that combines financial, political, business management, and accounting corpora. |
| |
|
| | ## Model Variants |
| |
|
| | Two variants are available: |
| |
|
| | - **DeB3RTa-base**: 12 attention heads, 12 layers, intermediate size of 3072, hidden size of 768 (~426M parameters) |
| | - **DeB3RTa-small**: 6 attention heads, 12 layers, intermediate size of 1536, hidden size of 384 (~70M parameters) |
| |
|
| | ## Key Features |
| |
|
| | - First Portuguese financial domain-specific transformer model |
| | - Mixed-domain pretraining incorporating finance, politics, business, and accounting texts |
| | - Enhanced performance on financial NLP tasks compared to general-domain models |
| | - Resource-efficient architecture with strong performance-to-parameter ratio |
| | - Advanced fine-tuning techniques including layer reinitialization, mixout regularization, and layer-wise learning rate decay |
| |
|
| | ## Performance |
| |
|
| | The models have been evaluated on multiple financial domain tasks: |
| |
|
| | | Task | Dataset | DeB3RTa-base F1 | DeB3RTa-small F1 | |
| | |------|----------|-----------------|------------------| |
| | | Fake News Detection | FAKE.BR | 0.9906 | 0.9598 | |
| | | Sentiment Analysis | CAROSIA | 0.9207 | 0.8722 | |
| | | Regulatory Classification | BBRC | 0.7609 | 0.6712 | |
| | | Hate Speech Detection | OFFCOMBR-3 | 0.7539 | 0.5460 | |
| |
|
| | ## Training Data |
| |
|
| | The models were trained on a diverse corpus of 1.05 billion tokens, including: |
| | - Financial market relevant facts (2003-2023) |
| | - Financial patents (2006-2021) |
| | - Research articles from Brazilian Scielo |
| | - Financial news articles (1999-2023) |
| | - Wikipedia articles in Portuguese |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForMaskedLM, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model = AutoModelForMaskedLM.from_pretrained("higopires/DeB3RTa-[base/small]") |
| | tokenizer = AutoTokenizer.from_pretrained("higopires/DeB3RTa-[base/small]") |
| | |
| | # Example usage |
| | text = "O mercado financeiro brasileiro apresentou [MASK] no último trimestre." |
| | inputs = tokenizer(text, return_tensors="pt") |
| | outputs = model(**inputs) |
| | ``` |
| |
|
| | ## Citations |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @article{pires2025deb3rta, |
| | AUTHOR = {Pires, Higo and Paucar, Leonardo and Carvalho, Joao Paulo}, |
| | TITLE = {DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain}, |
| | JOURNAL = {Big Data and Cognitive Computing}, |
| | VOLUME = {9}, |
| | YEAR = {2025}, |
| | NUMBER = {3}, |
| | ARTICLE-NUMBER = {51}, |
| | URL = {https://www.mdpi.com/2504-2289/9/3/51}, |
| | ISSN = {2504-2289}, |
| | ABSTRACT = {The complex and specialized terminology of financial language in Portuguese-speaking markets create significant challenges for natural language processing (NLP) applications, which must capture nuanced linguistic and contextual information to support accurate analysis and decision-making. This paper presents DeB3RTa, a transformer-based model specifically developed through a mixed-domain pretraining strategy that combines extensive corpora from finance, politics, business management, and accounting to enable a nuanced understanding of financial language. DeB3RTa was evaluated against prominent models—including BERTimbau, XLM-RoBERTa, SEC-BERT, BusinessBERT, and GPT-based variants—and consistently achieved significant gains across key financial NLP benchmarks. To maximize adaptability and accuracy, DeB3RTa integrates advanced fine-tuning techniques such as layer reinitialization, mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, which together enhance its performance across varied and high-stakes NLP tasks. These findings underscore the efficacy of mixed-domain pretraining in building high-performance language models for specialized applications. With its robust performance in complex analytical and classification tasks, DeB3RTa offers a powerful tool for advancing NLP in the financial sector and supporting nuanced language processing needs in Portuguese-speaking contexts.}, |
| | DOI = {10.3390/bdcc9030051} |
| | } |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | - Performance degradation on the smaller variant, particularly for hate speech detection |
| | - May require task-specific fine-tuning for optimal performance |
| | - Limited evaluation on multilingual financial tasks |
| | - Model behavior on very long documents (>128 tokens) not extensively tested |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|
| | Copyright (c) 2025 Higo Pires |
| |
|
| | Permission is hereby granted, free of charge, to any person obtaining a copy |
| | of this software and associated documentation files (the "Software"), to deal |
| | in the Software without restriction, including without limitation the rights |
| | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| | copies of the Software, and to permit persons to whom the Software is |
| | furnished to do so, subject to the following conditions: |
| |
|
| | The above copyright notice and this permission notice shall be included in all |
| | copies or substantial portions of the Software. |
| |
|
| | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
| | SOFTWARE. |
| |
|
| | ## Acknowledgments |
| |
|
| | This work was supported by the Instituto Federal de Educação, Ciência e Tecnologia do Maranhão and the Human Language Technology Lab in Instituto de Engenharia de Sistemas e Computadores—Investigação e Desenvolvimento (INESC-ID). |