FinancialModernBERT
A number-aware BERT model for financial document understanding, built on ModernBERT-base.
What this model does differently
Standard language models tokenize numbers as arbitrary subword pieces โ "12,345" becomes tokens like "12", ",", "345" โ losing all numerical meaning. FinancialModernBERT solves this by:
- Number tagging: A preprocessing step wraps numbers in
<number>...</number>tags - Log-magnitude encoding: Each number is encoded as its logโโ magnitude (e.g. 1000 โ 3.0) into a learned embedding via interpolated magnitude bins
- Dual prediction heads: MLM head for text tokens + magnitude head for number tokens, trained jointly
- Table-aware tokenization: HTML tables are linearized with structural delimiters (
[TABLE_START],\t,\n,[TABLE_END])
The model handles magnitudes from 10โปยนยฒ to 10ยนยฒ (configurable).
Installation
pip install git+https://huggingface.co/edereynal/financial_bert
Or clone and install:
git clone https://huggingface.co/edereynal/financial_bert
cd financial_bert
pip install -e .
Quick start
Preprocessing: tag numbers in your text
Before tokenizing, numbers in your text must be wrapped in <number> tags. Use the built-in tagger:
from financial_bert import tag_numbers_in_text
raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase."
tagged = tag_numbers_in_text(raw_text)
# "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase."
Tokenization
from financial_bert import FinancialBertTokenizer
tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
text = "Revenue was $<number>1234567</number> in Q3."
encoded = tokenizer(text, max_length=128)
# Returns dict with:
# input_ids: standard token IDs (numbers replaced with placeholder)
# attention_mask: 1 for real tokens, 0 for padding
# is_number_mask: 1 at number positions, 0 elsewhere
# number_values: log10(magnitude) at number positions, 0.0 elsewhere
Loading the model
import torch
from huggingface_hub import hf_hub_download
from financial_bert import FinancialModernBert, FinancialModernBertConfig
config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base")
config.num_magnitude_bins = 128
model = FinancialModernBert(config)
# MLM pretrained weights (text + number prediction)
weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
# Or: CLS encoder weights (trained with encoder/decoder bottleneck objective โ better for embeddings)
weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
To build a fresh model from pretrained ModernBERT (no financial fine-tuning):
from financial_bert import build_model
model = build_model("answerdotai/ModernBERT-base")
MLM inference
import torch
tokenizer = FinancialBertTokenizer()
model.eval()
text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>."
encoded = tokenizer(text, max_length=128)
with torch.no_grad():
outputs = model(
input_ids=encoded["input_ids"],
number_values=encoded["number_values"],
is_number_mask=encoded["is_number_mask"],
attention_mask=encoded["attention_mask"],
)
# outputs["text_logits"]: (batch, seq_len, vocab_size)
# outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins)
CLS sentence embedding
The CLS token (position 0) captures a document-level representation. This is trained via a CLS-bottleneck encoder/decoder objective where the decoder reconstructs masked chunks from only the encoder's CLS embedding.
tokenizer = FinancialBertTokenizer()
model.eval()
text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>."
encoded = tokenizer(text, max_length=512)
with torch.no_grad():
cls_embedding = model.get_cls_embedding(
input_ids=encoded["input_ids"],
number_values=encoded["number_values"],
is_number_mask=encoded["is_number_mask"],
attention_mask=encoded["attention_mask"],
) # shape: (1, 768)
Use CLS embeddings for downstream tasks like classification, regression, or retrieval.
Fine-tuning
MLM pre-training
The MLM pipeline trains all parameters โ backbone, number embedder, and number head โ jointly:
from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text
import torch
# Build model (initialized from pretrained ModernBERT)
model = build_model("answerdotai/ModernBERT-base")
tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
# Prepare a training example
text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.")
encoded = tokenizer(text, max_length=256)
# Create MLM labels (mask ~15% of tokens)
input_ids = encoded["input_ids"].clone()
is_number_mask = encoded["is_number_mask"]
number_values = encoded["number_values"]
attention_mask = encoded["attention_mask"]
# Random masking
mask_prob = 0.15
rand = torch.rand_like(input_ids, dtype=torch.float)
mask_positions = (rand < mask_prob) & (attention_mask == 1)
mask_positions[:, 0] = False # don't mask CLS
# Text labels
labels_text = torch.full_like(input_ids, -100)
text_mask_positions = mask_positions & (is_number_mask == 0)
labels_text[text_mask_positions] = input_ids[text_mask_positions]
input_ids[text_mask_positions] = tokenizer.mask_token_id
# Number labels
labels_magnitude = torch.full_like(number_values, -100.0)
num_mask_positions = mask_positions & (is_number_mask == 1)
labels_magnitude[num_mask_positions] = number_values[num_mask_positions]
number_values[num_mask_positions] = model.config.magnitude_max + 1.0 # sentinel
input_ids[num_mask_positions] = tokenizer.mask_token_id
# Forward pass
outputs = model(
input_ids=input_ids,
number_values=number_values,
is_number_mask=is_number_mask,
attention_mask=attention_mask,
labels_text=labels_text,
labels_magnitude=labels_magnitude,
)
loss = outputs["loss"] # combined text CE + magnitude bin loss
loss.backward()
Classification / regression head
import torch.nn as nn
class FinancialClassifier(nn.Module):
def __init__(self, encoder, num_classes):
super().__init__()
self.encoder = encoder
self.head = nn.Linear(encoder.config.hidden_size, num_classes)
def forward(self, input_ids, number_values, is_number_mask, attention_mask):
cls = self.encoder.get_cls_embedding(
input_ids, number_values, is_number_mask, attention_mask
)
return self.head(cls)
model = FinancialClassifier(encoder=model, num_classes=3)
Benchmarks
Numeracy ordering (29 test groups)
Each test group has three structurally identical sentences differing only in numerical magnitude (low, mid, high), with a tight ~5x spread within the same unit (e.g. $74.1M / $192.8M / $381.5M). Includes prose statements (dollar amounts, percentages, ratios, per-share figures) and HTML financial tables (income statements, balance sheets, cash flow, per-share data).
- Hard pass: d(low,mid) < d(low,high) AND d(mid,high) < d(low,high) โ mid is between low and high in embedding space
- Soft pass: avg(d(low,mid), d(mid,high)) < d(low,high)
Distance metric: MSE on raw (unnormalized) CLS embeddings.
| Model | Hard | Soft |
|---|---|---|
| CLS (enc/dec) | 17/29 (59%) | 24/29 (83%) |
| ModernBERT-base | 11/29 (38%) | 13/29 (45%) |
| BGE-base-v1.5 | 10/29 (34%) | 15/29 (52%) |
The CLS encoder/decoder model preserves numerical ordering in its embeddings even at tight magnitude spreads. ModernBERT-base and BGE-base-v1.5 both fall to near-chance, confirming that the enc/dec training objective gives the model genuine magnitude sensitivity beyond what the pretrained backbone or a general embedding model provides.
Semantic retrieval (20 query-match pairs)
Each query is a financial statement with specific numbers; each match is a paraphrase with rounded/restated figures. All 20 matches form the distractor pool. Metric: Recall@1 using cosine similarity on L2-normalized CLS embeddings.
| Model | Recall@1 | MRR |
|---|---|---|
| BGE-base-v1.5 | 20/20 | 1.000 |
| CLS (enc/dec) | 14/20 | 0.770 |
| ModernBERT-base | 1/20 | 0.207 |
The CLS encoder/decoder objective gives the model strong semantic matching ability (14/20 Recall@1) compared to the untrained backbone (1/20), though it does not match a purpose-built embedding model like BGE.
Architecture details
| Component | Description |
|---|---|
| Backbone | ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) |
| NumberEmbedder | 129 magnitude bins (128 + mask), interpolated embeddings |
| NumberHead | Gated projection โ LayerNorm โ linear to magnitude bins |
| PredictionHead | Dense โ GELU โ LayerNorm โ tied decoder (standard MLM head) |
License
Apache 2.0
Model tree for edereynal/financial_bert
Base model
answerdotai/ModernBERT-base