Instructions to use edereynal/financial_bert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use edereynal/financial_bert with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("edereynal/financial_bert", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - financial | |
| - numbers | |
| - modernbert | |
| - mlm | |
| base_model: answerdotai/ModernBERT-base | |
| # FinancialModernBERT | |
| A number-aware BERT model for financial document understanding, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). | |
| ## What this model does differently | |
| Standard language models tokenize numbers as arbitrary subword pieces — "12,345" becomes tokens like "12", ",", "345" — losing all numerical meaning. FinancialModernBERT solves this by: | |
| 1. **Number tagging**: A preprocessing step wraps numbers in `<number>...</number>` tags | |
| 2. **Log-magnitude encoding**: Each number is encoded as its log₁₀ magnitude (e.g. 1000 → 3.0) into a learned embedding via interpolated magnitude bins | |
| 3. **Dual prediction heads**: MLM head for text tokens + magnitude head for number tokens, trained jointly | |
| 4. **Table-aware tokenization**: HTML tables are linearized with structural delimiters (`[TABLE_START]`, `\t`, `\n`, `[TABLE_END]`) | |
| The model handles magnitudes from 10⁻¹² to 10¹² (configurable). | |
| ## Installation | |
| ```bash | |
| pip install git+https://huggingface.co/edereynal/financial_bert | |
| ``` | |
| Or clone and install: | |
| ```bash | |
| git clone https://huggingface.co/edereynal/financial_bert | |
| cd financial_bert | |
| pip install -e . | |
| ``` | |
| ## Quick start | |
| ### Preprocessing: tag numbers in your text | |
| Before tokenizing, numbers in your text must be wrapped in `<number>` tags. Use the built-in tagger: | |
| ```python | |
| from financial_bert import tag_numbers_in_text | |
| raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase." | |
| tagged = tag_numbers_in_text(raw_text) | |
| # "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase." | |
| ``` | |
| ### Tokenization | |
| ```python | |
| from financial_bert import FinancialBertTokenizer | |
| tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base") | |
| text = "Revenue was $<number>1234567</number> in Q3." | |
| encoded = tokenizer(text, max_length=128) | |
| # Returns dict with: | |
| # input_ids: standard token IDs (numbers replaced with placeholder) | |
| # attention_mask: 1 for real tokens, 0 for padding | |
| # is_number_mask: 1 at number positions, 0 elsewhere | |
| # number_values: log10(magnitude) at number positions, 0.0 elsewhere | |
| ``` | |
| ### Loading the model | |
| ```python | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from financial_bert import FinancialModernBert, FinancialModernBertConfig | |
| config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base") | |
| config.num_magnitude_bins = 128 | |
| model = FinancialModernBert(config) | |
| # MLM pretrained weights (text + number prediction) | |
| weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt") | |
| model.load_state_dict(torch.load(weights_path, map_location="cpu")) | |
| # Or: CLS encoder weights (trained with encoder/decoder bottleneck objective — better for embeddings) | |
| weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt") | |
| model.load_state_dict(torch.load(weights_path, map_location="cpu")) | |
| ``` | |
| To build a fresh model from pretrained ModernBERT (no financial fine-tuning): | |
| ```python | |
| from financial_bert import build_model | |
| model = build_model("answerdotai/ModernBERT-base") | |
| ``` | |
| ### MLM inference | |
| ```python | |
| import torch | |
| tokenizer = FinancialBertTokenizer() | |
| model.eval() | |
| text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>." | |
| encoded = tokenizer(text, max_length=128) | |
| with torch.no_grad(): | |
| outputs = model( | |
| input_ids=encoded["input_ids"], | |
| number_values=encoded["number_values"], | |
| is_number_mask=encoded["is_number_mask"], | |
| attention_mask=encoded["attention_mask"], | |
| ) | |
| # outputs["text_logits"]: (batch, seq_len, vocab_size) | |
| # outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins) | |
| ``` | |
| ### CLS sentence embedding | |
| The CLS token (position 0) captures a document-level representation. This is trained via a CLS-bottleneck encoder/decoder objective where the decoder reconstructs masked chunks from only the encoder's CLS embedding. | |
| ```python | |
| tokenizer = FinancialBertTokenizer() | |
| model.eval() | |
| text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>." | |
| encoded = tokenizer(text, max_length=512) | |
| with torch.no_grad(): | |
| cls_embedding = model.get_cls_embedding( | |
| input_ids=encoded["input_ids"], | |
| number_values=encoded["number_values"], | |
| is_number_mask=encoded["is_number_mask"], | |
| attention_mask=encoded["attention_mask"], | |
| ) # shape: (1, 768) | |
| ``` | |
| Use CLS embeddings for downstream tasks like classification, regression, or retrieval. | |
| ## Fine-tuning | |
| ### MLM pre-training | |
| The MLM pipeline trains all parameters — backbone, number embedder, and number head — jointly: | |
| ```python | |
| from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text | |
| import torch | |
| # Build model (initialized from pretrained ModernBERT) | |
| model = build_model("answerdotai/ModernBERT-base") | |
| tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base") | |
| # Prepare a training example | |
| text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.") | |
| encoded = tokenizer(text, max_length=256) | |
| # Create MLM labels (mask ~15% of tokens) | |
| input_ids = encoded["input_ids"].clone() | |
| is_number_mask = encoded["is_number_mask"] | |
| number_values = encoded["number_values"] | |
| attention_mask = encoded["attention_mask"] | |
| # Random masking | |
| mask_prob = 0.15 | |
| rand = torch.rand_like(input_ids, dtype=torch.float) | |
| mask_positions = (rand < mask_prob) & (attention_mask == 1) | |
| mask_positions[:, 0] = False # don't mask CLS | |
| # Text labels | |
| labels_text = torch.full_like(input_ids, -100) | |
| text_mask_positions = mask_positions & (is_number_mask == 0) | |
| labels_text[text_mask_positions] = input_ids[text_mask_positions] | |
| input_ids[text_mask_positions] = tokenizer.mask_token_id | |
| # Number labels | |
| labels_magnitude = torch.full_like(number_values, -100.0) | |
| num_mask_positions = mask_positions & (is_number_mask == 1) | |
| labels_magnitude[num_mask_positions] = number_values[num_mask_positions] | |
| number_values[num_mask_positions] = model.config.magnitude_max + 1.0 # sentinel | |
| input_ids[num_mask_positions] = tokenizer.mask_token_id | |
| # Forward pass | |
| outputs = model( | |
| input_ids=input_ids, | |
| number_values=number_values, | |
| is_number_mask=is_number_mask, | |
| attention_mask=attention_mask, | |
| labels_text=labels_text, | |
| labels_magnitude=labels_magnitude, | |
| ) | |
| loss = outputs["loss"] # combined text CE + magnitude bin loss | |
| loss.backward() | |
| ``` | |
| ### Classification / regression head | |
| ```python | |
| import torch.nn as nn | |
| class FinancialClassifier(nn.Module): | |
| def __init__(self, encoder, num_classes): | |
| super().__init__() | |
| self.encoder = encoder | |
| self.head = nn.Linear(encoder.config.hidden_size, num_classes) | |
| def forward(self, input_ids, number_values, is_number_mask, attention_mask): | |
| cls = self.encoder.get_cls_embedding( | |
| input_ids, number_values, is_number_mask, attention_mask | |
| ) | |
| return self.head(cls) | |
| model = FinancialClassifier(encoder=model, num_classes=3) | |
| ``` | |
| ## Benchmarks | |
| ### Numeracy ordering (29 test groups) | |
| Each test group has three structurally identical sentences differing only in numerical magnitude (low, mid, high), with a tight ~5x spread within the same unit (e.g. $74.1M / $192.8M / $381.5M). Includes prose statements (dollar amounts, percentages, ratios, per-share figures) and HTML financial tables (income statements, balance sheets, cash flow, per-share data). | |
| - **Hard pass**: d(low,mid) < d(low,high) AND d(mid,high) < d(low,high) — mid is between low and high in embedding space | |
| - **Soft pass**: avg(d(low,mid), d(mid,high)) < d(low,high) | |
| Distance metric: MSE on raw (unnormalized) CLS embeddings. | |
| | Model | Hard | Soft | | |
| |---|---|---| | |
| | **CLS (enc/dec)** | **17/29 (59%)** | **24/29 (83%)** | | |
| | ModernBERT-base | 11/29 (38%) | 13/29 (45%) | | |
| | BGE-base-v1.5 | 10/29 (34%) | 15/29 (52%) | | |
| The CLS encoder/decoder model preserves numerical ordering in its embeddings even at tight magnitude spreads. ModernBERT-base and BGE-base-v1.5 both fall to near-chance, confirming that the enc/dec training objective gives the model genuine magnitude sensitivity beyond what the pretrained backbone or a general embedding model provides. | |
| ### Semantic retrieval (20 query-match pairs) | |
| Each query is a financial statement with specific numbers; each match is a paraphrase with rounded/restated figures. All 20 matches form the distractor pool. Metric: Recall@1 using cosine similarity on L2-normalized CLS embeddings. | |
| | Model | Recall@1 | MRR | | |
| |---|---|---| | |
| | BGE-base-v1.5 | **20/20** | **1.000** | | |
| | **CLS (enc/dec)** | **14/20** | **0.770** | | |
| | ModernBERT-base | 1/20 | 0.207 | | |
| The CLS encoder/decoder objective gives the model strong semantic matching ability (14/20 Recall@1) compared to the untrained backbone (1/20), though it does not match a purpose-built embedding model like BGE. | |
| ## Architecture details | |
| | Component | Description | | |
| |---|---| | |
| | **Backbone** | ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) | | |
| | **NumberEmbedder** | 129 magnitude bins (128 + mask), interpolated embeddings | | |
| | **NumberHead** | Gated projection → LayerNorm → linear to magnitude bins | | |
| | **PredictionHead** | Dense → GELU → LayerNorm → tied decoder (standard MLM head) | | |
| ## License | |
| Apache 2.0 | |