Updating benchmarks

560a36a 30 days ago

9.48 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	tags:
	- financial
	- numbers
	- modernbert
	- mlm
	base_model: answerdotai/ModernBERT-base
	---

	# FinancialModernBERT

	A number-aware BERT model for financial document understanding, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

	## What this model does differently

	Standard language models tokenize numbers as arbitrary subword pieces — "12,345" becomes tokens like "12", ",", "345" — losing all numerical meaning. FinancialModernBERT solves this by:

	1. Number tagging: A preprocessing step wraps numbers in `<number>...</number>` tags
	2. Log-magnitude encoding: Each number is encoded as its log₁₀ magnitude (e.g. 1000 → 3.0) into a learned embedding via interpolated magnitude bins
	3. Dual prediction heads: MLM head for text tokens + magnitude head for number tokens, trained jointly
	4. Table-aware tokenization: HTML tables are linearized with structural delimiters (`[TABLE_START]`, `\t`, `\n`, `[TABLE_END]`)

	The model handles magnitudes from 10⁻¹² to 10¹² (configurable).

	## Installation

	```bash
	pip install git+https://huggingface.co/edereynal/financial_bert
	```

	Or clone and install:

	```bash
	git clone https://huggingface.co/edereynal/financial_bert
	cd financial_bert
	pip install -e .
	```

	## Quick start

	### Preprocessing: tag numbers in your text

	Before tokenizing, numbers in your text must be wrapped in `<number>` tags. Use the built-in tagger:

	```python
	from financial_bert import tag_numbers_in_text

	raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase."
	tagged = tag_numbers_in_text(raw_text)
	# "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase."
	```

	### Tokenization

	```python
	from financial_bert import FinancialBertTokenizer

	tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")

	text = "Revenue was $<number>1234567</number> in Q3."
	encoded = tokenizer(text, max_length=128)

	# Returns dict with:
	# input_ids: standard token IDs (numbers replaced with placeholder)
	# attention_mask: 1 for real tokens, 0 for padding
	# is_number_mask: 1 at number positions, 0 elsewhere
	# number_values: log10(magnitude) at number positions, 0.0 elsewhere
	```

	### Loading the model

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from financial_bert import FinancialModernBert, FinancialModernBertConfig

	config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base")
	config.num_magnitude_bins = 128
	model = FinancialModernBert(config)

	# MLM pretrained weights (text + number prediction)
	weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt")
	model.load_state_dict(torch.load(weights_path, map_location="cpu"))

	# Or: CLS encoder weights (trained with encoder/decoder bottleneck objective — better for embeddings)
	weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt")
	model.load_state_dict(torch.load(weights_path, map_location="cpu"))
	```

	To build a fresh model from pretrained ModernBERT (no financial fine-tuning):

	```python
	from financial_bert import build_model
	model = build_model("answerdotai/ModernBERT-base")
	```

	### MLM inference

	```python
	import torch

	tokenizer = FinancialBertTokenizer()
	model.eval()

	text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>."
	encoded = tokenizer(text, max_length=128)

	with torch.no_grad():
	outputs = model(
	input_ids=encoded["input_ids"],
	number_values=encoded["number_values"],
	is_number_mask=encoded["is_number_mask"],
	attention_mask=encoded["attention_mask"],
	)

	# outputs["text_logits"]: (batch, seq_len, vocab_size)
	# outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins)
	```

	### CLS sentence embedding

	The CLS token (position 0) captures a document-level representation. This is trained via a CLS-bottleneck encoder/decoder objective where the decoder reconstructs masked chunks from only the encoder's CLS embedding.

	```python
	tokenizer = FinancialBertTokenizer()
	model.eval()

	text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>."
	encoded = tokenizer(text, max_length=512)

	with torch.no_grad():
	cls_embedding = model.get_cls_embedding(
	input_ids=encoded["input_ids"],
	number_values=encoded["number_values"],
	is_number_mask=encoded["is_number_mask"],
	attention_mask=encoded["attention_mask"],
	) # shape: (1, 768)
	```

	Use CLS embeddings for downstream tasks like classification, regression, or retrieval.

	## Fine-tuning

	### MLM pre-training

	The MLM pipeline trains all parameters — backbone, number embedder, and number head — jointly:

	```python
	from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text
	import torch

	# Build model (initialized from pretrained ModernBERT)
	model = build_model("answerdotai/ModernBERT-base")
	tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")

	# Prepare a training example
	text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.")
	encoded = tokenizer(text, max_length=256)

	# Create MLM labels (mask ~15% of tokens)
	input_ids = encoded["input_ids"].clone()
	is_number_mask = encoded["is_number_mask"]
	number_values = encoded["number_values"]
	attention_mask = encoded["attention_mask"]

	# Random masking
	mask_prob = 0.15
	rand = torch.rand_like(input_ids, dtype=torch.float)
	mask_positions = (rand < mask_prob) & (attention_mask == 1)
	mask_positions[:, 0] = False # don't mask CLS

	# Text labels
	labels_text = torch.full_like(input_ids, -100)
	text_mask_positions = mask_positions & (is_number_mask == 0)
	labels_text[text_mask_positions] = input_ids[text_mask_positions]
	input_ids[text_mask_positions] = tokenizer.mask_token_id

	# Number labels
	labels_magnitude = torch.full_like(number_values, -100.0)
	num_mask_positions = mask_positions & (is_number_mask == 1)
	labels_magnitude[num_mask_positions] = number_values[num_mask_positions]
	number_values[num_mask_positions] = model.config.magnitude_max + 1.0 # sentinel
	input_ids[num_mask_positions] = tokenizer.mask_token_id

	# Forward pass
	outputs = model(
	input_ids=input_ids,
	number_values=number_values,
	is_number_mask=is_number_mask,
	attention_mask=attention_mask,
	labels_text=labels_text,
	labels_magnitude=labels_magnitude,
	)

	loss = outputs["loss"] # combined text CE + magnitude bin loss
	loss.backward()
	```

	### Classification / regression head

	```python
	import torch.nn as nn

	class FinancialClassifier(nn.Module):
	def __init__(self, encoder, num_classes):
	super().__init__()
	self.encoder = encoder
	self.head = nn.Linear(encoder.config.hidden_size, num_classes)

	def forward(self, input_ids, number_values, is_number_mask, attention_mask):
	cls = self.encoder.get_cls_embedding(
	input_ids, number_values, is_number_mask, attention_mask
	)
	return self.head(cls)

	model = FinancialClassifier(encoder=model, num_classes=3)
	```

	## Benchmarks

	### Numeracy ordering (29 test groups)

	Each test group has three structurally identical sentences differing only in numerical magnitude (low, mid, high), with a tight ~5x spread within the same unit (e.g. $74.1M / $192.8M / $381.5M). Includes prose statements (dollar amounts, percentages, ratios, per-share figures) and HTML financial tables (income statements, balance sheets, cash flow, per-share data).

	- Hard pass: d(low,mid) < d(low,high) AND d(mid,high) < d(low,high) — mid is between low and high in embedding space
	- Soft pass: avg(d(low,mid), d(mid,high)) < d(low,high)

	Distance metric: MSE on raw (unnormalized) CLS embeddings.

	\| Model \| Hard \| Soft \|
	\|---\|---\|---\|
	\| CLS (enc/dec) \| 17/29 (59%) \| 24/29 (83%) \|
	\| ModernBERT-base \| 11/29 (38%) \| 13/29 (45%) \|
	\| BGE-base-v1.5 \| 10/29 (34%) \| 15/29 (52%) \|

	The CLS encoder/decoder model preserves numerical ordering in its embeddings even at tight magnitude spreads. ModernBERT-base and BGE-base-v1.5 both fall to near-chance, confirming that the enc/dec training objective gives the model genuine magnitude sensitivity beyond what the pretrained backbone or a general embedding model provides.

	### Semantic retrieval (20 query-match pairs)

	Each query is a financial statement with specific numbers; each match is a paraphrase with rounded/restated figures. All 20 matches form the distractor pool. Metric: Recall@1 using cosine similarity on L2-normalized CLS embeddings.

	\| Model \| Recall@1 \| MRR \|
	\|---\|---\|---\|
	\| BGE-base-v1.5 \| 20/20 \| 1.000 \|
	\| CLS (enc/dec) \| 14/20 \| 0.770 \|
	\| ModernBERT-base \| 1/20 \| 0.207 \|

	The CLS encoder/decoder objective gives the model strong semantic matching ability (14/20 Recall@1) compared to the untrained backbone (1/20), though it does not match a purpose-built embedding model like BGE.

	## Architecture details

	\| Component \| Description \|
	\|---\|---\|
	\| Backbone \| ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) \|
	\| NumberEmbedder \| 129 magnitude bins (128 + mask), interpolated embeddings \|
	\| NumberHead \| Gated projection → LayerNorm → linear to magnitude bins \|
	\| PredictionHead \| Dense → GELU → LayerNorm → tied decoder (standard MLM head) \|

	## License

	Apache 2.0