FinancialModernBERT

A number-aware BERT model for financial document understanding, built on ModernBERT-base.

What this model does differently

Standard language models tokenize numbers as arbitrary subword pieces โ€” "12,345" becomes tokens like "12", ",", "345" โ€” losing all numerical meaning. FinancialModernBERT solves this by:

  1. Number tagging: A preprocessing step wraps numbers in <number>...</number> tags
  2. Log-magnitude encoding: Each number is encoded as its logโ‚โ‚€ magnitude (e.g. 1000 โ†’ 3.0) into a learned embedding via interpolated magnitude bins
  3. Dual prediction heads: MLM head for text tokens + magnitude head for number tokens, trained jointly
  4. Table-aware tokenization: HTML tables are linearized with structural delimiters ([TABLE_START], \t, \n, [TABLE_END])

The model handles magnitudes from 10โปยนยฒ to 10ยนยฒ (configurable).

Installation

pip install git+https://huggingface.co/edereynal/financial_bert

Or clone and install:

git clone https://huggingface.co/edereynal/financial_bert
cd financial_bert
pip install -e .

Quick start

Preprocessing: tag numbers in your text

Before tokenizing, numbers in your text must be wrapped in <number> tags. Use the built-in tagger:

from financial_bert import tag_numbers_in_text

raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase."
tagged = tag_numbers_in_text(raw_text)
# "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase."

Tokenization

from financial_bert import FinancialBertTokenizer

tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")

text = "Revenue was $<number>1234567</number> in Q3."
encoded = tokenizer(text, max_length=128)

# Returns dict with:
#   input_ids:      standard token IDs (numbers replaced with placeholder)
#   attention_mask:  1 for real tokens, 0 for padding
#   is_number_mask:  1 at number positions, 0 elsewhere
#   number_values:   log10(magnitude) at number positions, 0.0 elsewhere

Loading the model

import torch
from huggingface_hub import hf_hub_download
from financial_bert import FinancialModernBert, FinancialModernBertConfig

config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base")
config.num_magnitude_bins = 128
model = FinancialModernBert(config)

# MLM pretrained weights (text + number prediction)
weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))

# Or: CLS encoder weights (trained with encoder/decoder bottleneck objective โ€” better for embeddings)
weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))

To build a fresh model from pretrained ModernBERT (no financial fine-tuning):

from financial_bert import build_model
model = build_model("answerdotai/ModernBERT-base")

MLM inference

import torch

tokenizer = FinancialBertTokenizer()
model.eval()

text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>."
encoded = tokenizer(text, max_length=128)

with torch.no_grad():
    outputs = model(
        input_ids=encoded["input_ids"],
        number_values=encoded["number_values"],
        is_number_mask=encoded["is_number_mask"],
        attention_mask=encoded["attention_mask"],
    )

# outputs["text_logits"]:      (batch, seq_len, vocab_size)
# outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins)

CLS sentence embedding

The CLS token (position 0) captures a document-level representation. This is trained via a CLS-bottleneck encoder/decoder objective where the decoder reconstructs masked chunks from only the encoder's CLS embedding.

tokenizer = FinancialBertTokenizer()
model.eval()

text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>."
encoded = tokenizer(text, max_length=512)

with torch.no_grad():
    cls_embedding = model.get_cls_embedding(
        input_ids=encoded["input_ids"],
        number_values=encoded["number_values"],
        is_number_mask=encoded["is_number_mask"],
        attention_mask=encoded["attention_mask"],
    )  # shape: (1, 768)

Use CLS embeddings for downstream tasks like classification, regression, or retrieval.

Fine-tuning

MLM pre-training

The MLM pipeline trains all parameters โ€” backbone, number embedder, and number head โ€” jointly:

from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text
import torch

# Build model (initialized from pretrained ModernBERT)
model = build_model("answerdotai/ModernBERT-base")
tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")

# Prepare a training example
text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.")
encoded = tokenizer(text, max_length=256)

# Create MLM labels (mask ~15% of tokens)
input_ids = encoded["input_ids"].clone()
is_number_mask = encoded["is_number_mask"]
number_values = encoded["number_values"]
attention_mask = encoded["attention_mask"]

# Random masking
mask_prob = 0.15
rand = torch.rand_like(input_ids, dtype=torch.float)
mask_positions = (rand < mask_prob) & (attention_mask == 1)
mask_positions[:, 0] = False  # don't mask CLS

# Text labels
labels_text = torch.full_like(input_ids, -100)
text_mask_positions = mask_positions & (is_number_mask == 0)
labels_text[text_mask_positions] = input_ids[text_mask_positions]
input_ids[text_mask_positions] = tokenizer.mask_token_id

# Number labels
labels_magnitude = torch.full_like(number_values, -100.0)
num_mask_positions = mask_positions & (is_number_mask == 1)
labels_magnitude[num_mask_positions] = number_values[num_mask_positions]
number_values[num_mask_positions] = model.config.magnitude_max + 1.0  # sentinel
input_ids[num_mask_positions] = tokenizer.mask_token_id

# Forward pass
outputs = model(
    input_ids=input_ids,
    number_values=number_values,
    is_number_mask=is_number_mask,
    attention_mask=attention_mask,
    labels_text=labels_text,
    labels_magnitude=labels_magnitude,
)

loss = outputs["loss"]  # combined text CE + magnitude bin loss
loss.backward()

Classification / regression head

import torch.nn as nn

class FinancialClassifier(nn.Module):
    def __init__(self, encoder, num_classes):
        super().__init__()
        self.encoder = encoder
        self.head = nn.Linear(encoder.config.hidden_size, num_classes)

    def forward(self, input_ids, number_values, is_number_mask, attention_mask):
        cls = self.encoder.get_cls_embedding(
            input_ids, number_values, is_number_mask, attention_mask
        )
        return self.head(cls)

model = FinancialClassifier(encoder=model, num_classes=3)

Benchmarks

Numeracy ordering (29 test groups)

Each test group has three structurally identical sentences differing only in numerical magnitude (low, mid, high), with a tight ~5x spread within the same unit (e.g. $74.1M / $192.8M / $381.5M). Includes prose statements (dollar amounts, percentages, ratios, per-share figures) and HTML financial tables (income statements, balance sheets, cash flow, per-share data).

  • Hard pass: d(low,mid) < d(low,high) AND d(mid,high) < d(low,high) โ€” mid is between low and high in embedding space
  • Soft pass: avg(d(low,mid), d(mid,high)) < d(low,high)

Distance metric: MSE on raw (unnormalized) CLS embeddings.

Model Hard Soft
CLS (enc/dec) 17/29 (59%) 24/29 (83%)
ModernBERT-base 11/29 (38%) 13/29 (45%)
BGE-base-v1.5 10/29 (34%) 15/29 (52%)

The CLS encoder/decoder model preserves numerical ordering in its embeddings even at tight magnitude spreads. ModernBERT-base and BGE-base-v1.5 both fall to near-chance, confirming that the enc/dec training objective gives the model genuine magnitude sensitivity beyond what the pretrained backbone or a general embedding model provides.

Semantic retrieval (20 query-match pairs)

Each query is a financial statement with specific numbers; each match is a paraphrase with rounded/restated figures. All 20 matches form the distractor pool. Metric: Recall@1 using cosine similarity on L2-normalized CLS embeddings.

Model Recall@1 MRR
BGE-base-v1.5 20/20 1.000
CLS (enc/dec) 14/20 0.770
ModernBERT-base 1/20 0.207

The CLS encoder/decoder objective gives the model strong semantic matching ability (14/20 Recall@1) compared to the untrained backbone (1/20), though it does not match a purpose-built embedding model like BGE.

Architecture details

Component Description
Backbone ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention)
NumberEmbedder 129 magnitude bins (128 + mask), interpolated embeddings
NumberHead Gated projection โ†’ LayerNorm โ†’ linear to magnitude bins
PredictionHead Dense โ†’ GELU โ†’ LayerNorm โ†’ tied decoder (standard MLM head)

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for edereynal/financial_bert

Finetuned
(1247)
this model