File size: 6,714 Bytes
6664cb6 bef6c1a cd2ef46 bef6c1a f0fbea1 bef6c1a f0fbea1 bef6c1a da9e667 bef6c1a 90342dd bef6c1a 90342dd cd2ef46 bef6c1a 192f883 bef6c1a 192f883 bef6c1a 192f883 bef6c1a 192f883 0b0359c bef6c1a edb3859 bef6c1a edb3859 bef6c1a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: mit
pipeline_tag: fill-mask
library_name: transformers
tags:
- lookahead
- lookback
- language
---
# NoLBert: A Time-Stamped Pre-Trained LLM
**NoLBERT** (*No Lookahead(back) bias bidirectional encoder representation from transformers*) is a foundational transformer-based language model specifically trained on a small time-restricted dataset to avoid both <span style='color: blue;'>Lookahead</span> and <span style='color: #ff8c00;'>Lookaback</span> bias. Furthermore, in order to make the model accessible even on personal machines, we adopt the architecture of DeBERTaV3-base which has a relatively small number of trainable parameters but performs well on linguistic benchmarks.
**<span style='color: blue;'>Lookahead</span> bias** is a fundamental challenge when researchers and practitioners use inferences from language models for forecasting. For example, when we ask a language model to infer the short-term return of a stock given a set of news articles, a concern is that the model may have been trained on data that includes future information beyond the point in time when the news articles were released. As a result, the nature of the task changes from drawing return-related inference from text to retrieving the date of the news articles and the realized returns of the particular stock shortly after that date. Consequently, this approach becomes invalid in practice when using such models to predict stock returns beyond the training data's coverage period. To frame the task as one of natural language inference, we pre-train a new text encoder using data strictly *from 1976 to 1995*. Therefore, our model exhibits no lookahead bias when backtesting trading strategies using data from 1996 onward or when performing other time series forecasting tasks using text data.
Another key feature of our model is that it also avoids **<span style='color: #ff8c00;'>Lookaback</span> bias**. In particular, after pre-training, the numerical representation provided by any model reflects a snapshot in time (although the exact time may not be well-defined). For example, in the early 1900s, the sentence “She is running a program” likely meant that the person was organizing an event. By contrast, in the late 20th century, the same sentence likely refers to someone executing a computer code. Since a model is trained to learn from all of its training data to form text representations, if it is trained using data spanning a long time horizon, it becomes unclear which period the final encoded vector represents. In this example, if the model is trained on data from the entire 20th century, the resulting numerical representation may exhibit lookback bias when the intention is to analyze texts from more recent periods. To overcome this, we use a highly restricted time window: all of our model's training data are *from 1976 to 1995*, and our validation set is strictly from 1996.
Our model is trained on 1 billion words (1-2 billion tokens) from Parliament Q&As, TV show conversations, music lyrics, patents, FOMC documents, public access books, newspapers, election campaign documents, and research papers. The model is based on the base-size DeBERTa model architecture and a custom ByteLevelBPETokenizer trained using the same training data.
Our model achieves state-of-the-art performance with less than 10% of training data.
| Model | Vocabulary (K) | Backbone #Params (M) | COLA | SST2 | QQP|MNLI|QNLI
|------------------------|:--------------:|:--------------------:|:-----------------:|:---------------:|:--------------------:|:-----------------:|:---------------:|
| FinBERT | 30 | 110 | 0.29|0.89|0.87|0.79|0.86|
| StoriesLM | 30 | 110 | **0.47**|0.90|0.87|0.80|0.87|
| NoLBERT | 30 | 109 | 0.43|**0.91**|**0.91**|**0.82**|**0.89**
## Usage Examples
### Masked Language Modeling
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Using GPU
device = 'cuda:0'
checkpoint_path = "alikLab/NoLBERT"
model = AutoModelForMaskedLM.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)
text = "The day after Monday is<mask>."
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Get the index of [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
# Get logits for the mask position
mask_logits = logits[0, mask_token_index, :].squeeze()
# Get top 10 predictions
top_10_token_ids = torch.topk(mask_logits, 10).indices
top_10_tokens = [tokenizer.decode(token_id) for token_id in top_10_token_ids]
top_10_probs = torch.softmax(mask_logits, dim=-1)[top_10_token_ids]
print("Top 10 most likely words:")
for i, (token, prob) in enumerate(zip(top_10_tokens, top_10_probs)):
print(f"{i+1:2d}. {token:<12} (probability: {prob:.4f})")
```
### Getting Text Embeddings
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Using GPU
device = 'cuda:0'
checkpoint_path = "alikLab/NoLBERT"
# Use AutoModel instead of AutoModelForMaskedLM to get embeddings
model = AutoModel.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)
text = "The day after Monday is Tuesday."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
# Get the hidden states
last_hidden_states = outputs.last_hidden_state
# Method 1: Use [CLS] token embedding (first token)
cls_embedding = last_hidden_states[0, 0, :] # Shape: [hidden_size]
# Method 2: Mean pooling over all tokens (excluding padding)
attention_mask = inputs['attention_mask']
masked_embeddings = last_hidden_states * attention_mask.unsqueeze(-1)
mean_embedding = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Mean pooled embedding shape: {mean_embedding.shape}")
print(f"Text: {text}")
print(f"Embedding (first 10 dimensions): {cls_embedding[:10].tolist()}")
```
See our [paper](https://arxiv.org/abs/2509.01110) for more details (NeurIPS 2025 Gen AI in Finance).
## Citation
If you use this model in your research, please cite:
```
@misc{nolbert,
author = {Ali Kakhbod, Peiyao Li},
title = {NoLBERT: A No Lookahead(back) Foundational Language Model},
year = {2025},
journal = {NeurIPS 2025 (GenAI in Finance)},
howpublished = {\url{https://huggingface.co/alikLab/NoLBERT}},
}
```
|