alikLab
/

NoLBERT

 - lookahead
 - lookback
 - language
+---
+# Model Card for NoLBert: A Time-Stamped Pre-Trained LLM
+NoLBERT (No Lookahead(back) bias bidirectional encoder representation from transformers) is a foundational transformer-based language model specifically trained to avoid both lookahead and lookback bias.
+Lookahead bias is a fundamental challenge when researchers and practitioners use inferences from language models for forecasting. For example, when we ask a language model to infer the short-term return of a stock given a set of news articles, a concern is that the model may have been trained on data that include future information beyond the point in time when the news articles were released. As a result, the nature of the task changes from drawing return-related inference from text to retrieving the date of the news articles and the realized returns of the particular stock shortly after that date. Consequently, this approach becomes invalid in practice when using such models to predict stock returns beyond the training data's coverage period. To frame the task as one of natural language inference, we pre-train a new text encoder using data strictly from 1976 to 1995. Therefore, our model exhibits no lookahead bias when backtesting trading strategies using data from 1996 onward or when performing other time series forecasting tasks using text data.
+Another key feature of our model is that it also avoids lookback bias. In particular, after pre-training, the numerical representation provided by any model reflects a snapshot in time (although the exact time may not be well-defined). For example, in the early 1900s, the sentence “She is running a program” likely meant that the person was organizing an event. By contrast, in the late 20th century, the same sentence likely refers to someone executing a computer code. Since a model is trained to learn from all of its training data to form text representations, if it is trained using data spanning a long time horizon, it becomes unclear which period the final encoded vector represents. In this example, if the model is trained on data from the entire 20th century, the resulting numerical representation may exhibit lookback bias when the intention is to analyze texts from more recent periods. To overcome this, we use a highly restricted time window: all of our model's training data are from 1976 to 1995, and our validation set is strictly from 1996.
+Our model is trained on 1 billion words (1-2 billion tokens) from Parliament Q&As, TV show conversations,  music lyrics, patents, FOMC documents, public access books, newspapers, election campaign documents, and research papers. The model is based on the base-size DeBERTa model architecture and a custom ByteLevelBPETokenizer trained using the same training data.
+Our model achieves nearly state-of-the-art performance with less than 1% of training data and the smallest model size.
+| Model                  | Vocabulary (K) | Backbone #Params (M) | COLA | SST2 | QQP|MNLI|QNLI
+|------------------------|:--------------:|:--------------------:|:-----------------:|:---------------:|:--------------------:|:-----------------:|:---------------:|
+| ChronoBERT_1999          | 50             | 149                   | 0.57|0.92|0.89|0.86|0.91|
+| FinBERT             | 30             | 110                   | 0.29|0.89|0.87|0.79|0.86|
+| StoriesLM           | 30             | 110                   | 0.47|0.90|0.87|0.80|0.87|
+| NolBERT           | 30             | 109                  | 0.43|0.91|0.91|0.82|0.89
+# Example Usage:
+## Usage Example (transformers 4.50+)
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+#using GPU
+device = 'cuda:0'
+checkpoint_path = "alikLab/NoLBERT"
+model = AutoModelForMaskedLM.from_pretrained(checkpoint_path).to(device)
+tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)
+text = "The day after Monday is<mask>."
+inputs = tokenizer(text, return_tensors="pt").to(device)
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits
+# Get the index of [MASK]
+mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+# Get logits for the mask position
+mask_logits = logits[0, mask_token_index, :].squeeze()
+# Get top 10 predictions
+top_10_token_ids = torch.topk(mask_logits, 10).indices
+top_10_tokens = [tokenizer.decode(token_id) for token_id in top_10_token_ids]
+top_10_probs = torch.softmax(mask_logits, dim=-1)[top_10_token_ids]
+print("Top 10 most likely words:")
+for i, (token, prob) in enumerate(zip(top_10_tokens, top_10_probs)):
+    print(f"{i+1:2d}. {token:<12} (probability: {prob:.4f})")
+```
+## Citation
+If you use this model in your research, please cite:
+```
+@misc{nolbert,
+  author = {Ali Kakhbod, Peiyao Li},
+  title = {NoLBert: A Time-Stamped Pre-Trained LLM},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/alikLab/NoLBERT}},
+}
+```