ojhfklsjhl commited on
Commit
bef6c1a
·
verified ·
1 Parent(s): 77e12f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -1
README.md CHANGED
@@ -6,4 +6,81 @@ tags:
6
  - lookahead
7
  - lookback
8
  - language
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - lookahead
7
  - lookback
8
  - language
9
+ ---
10
+ # Model Card for NoLBert: A Time-Stamped Pre-Trained LLM
11
+
12
+ NoLBERT (No Lookahead(back) bias bidirectional encoder representation from transformers) is a foundational transformer-based language model specifically trained to avoid both lookahead and lookback bias.
13
+
14
+ Lookahead bias is a fundamental challenge when researchers and practitioners use inferences from language models for forecasting. For example, when we ask a language model to infer the short-term return of a stock given a set of news articles, a concern is that the model may have been trained on data that include future information beyond the point in time when the news articles were released. As a result, the nature of the task changes from drawing return-related inference from text to retrieving the date of the news articles and the realized returns of the particular stock shortly after that date. Consequently, this approach becomes invalid in practice when using such models to predict stock returns beyond the training data's coverage period. To frame the task as one of natural language inference, we pre-train a new text encoder using data strictly from 1976 to 1995. Therefore, our model exhibits no lookahead bias when backtesting trading strategies using data from 1996 onward or when performing other time series forecasting tasks using text data.
15
+
16
+ Another key feature of our model is that it also avoids lookback bias. In particular, after pre-training, the numerical representation provided by any model reflects a snapshot in time (although the exact time may not be well-defined). For example, in the early 1900s, the sentence “She is running a program” likely meant that the person was organizing an event. By contrast, in the late 20th century, the same sentence likely refers to someone executing a computer code. Since a model is trained to learn from all of its training data to form text representations, if it is trained using data spanning a long time horizon, it becomes unclear which period the final encoded vector represents. In this example, if the model is trained on data from the entire 20th century, the resulting numerical representation may exhibit lookback bias when the intention is to analyze texts from more recent periods. To overcome this, we use a highly restricted time window: all of our model's training data are from 1976 to 1995, and our validation set is strictly from 1996.
17
+
18
+ Our model is trained on 1 billion words (1-2 billion tokens) from Parliament Q&As, TV show conversations, music lyrics, patents, FOMC documents, public access books, newspapers, election campaign documents, and research papers. The model is based on the base-size DeBERTa model architecture and a custom ByteLevelBPETokenizer trained using the same training data.
19
+
20
+ Our model achieves nearly state-of-the-art performance with less than 1% of training data and the smallest model size.
21
+
22
+ | Model | Vocabulary (K) | Backbone #Params (M) | COLA | SST2 | QQP|MNLI|QNLI
23
+ |------------------------|:--------------:|:--------------------:|:-----------------:|:---------------:|:--------------------:|:-----------------:|:---------------:|
24
+ | ChronoBERT_1999 | 50 | 149 | 0.57|0.92|0.89|0.86|0.91|
25
+ | FinBERT | 30 | 110 | 0.29|0.89|0.87|0.79|0.86|
26
+ | StoriesLM | 30 | 110 | 0.47|0.90|0.87|0.80|0.87|
27
+ | NolBERT | 30 | 109 | 0.43|0.91|0.91|0.82|0.89
28
+
29
+ # Example Usage:
30
+
31
+ ## Usage Example (transformers 4.50+)
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
35
+ import torch
36
+
37
+
38
+ #using GPU
39
+ device = 'cuda:0'
40
+
41
+
42
+ checkpoint_path = "alikLab/NoLBERT"
43
+
44
+ model = AutoModelForMaskedLM.from_pretrained(checkpoint_path).to(device)
45
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)
46
+
47
+ text = "The day after Monday is<mask>."
48
+ inputs = tokenizer(text, return_tensors="pt").to(device)
49
+
50
+ with torch.no_grad():
51
+ outputs = model(**inputs)
52
+
53
+ logits = outputs.logits
54
+
55
+ # Get the index of [MASK]
56
+ mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
57
+
58
+ # Get logits for the mask position
59
+ mask_logits = logits[0, mask_token_index, :].squeeze()
60
+
61
+ # Get top 10 predictions
62
+ top_10_token_ids = torch.topk(mask_logits, 10).indices
63
+ top_10_tokens = [tokenizer.decode(token_id) for token_id in top_10_token_ids]
64
+ top_10_probs = torch.softmax(mask_logits, dim=-1)[top_10_token_ids]
65
+
66
+ print("Top 10 most likely words:")
67
+ for i, (token, prob) in enumerate(zip(top_10_tokens, top_10_probs)):
68
+ print(f"{i+1:2d}. {token:<12} (probability: {prob:.4f})")
69
+ ```
70
+
71
+
72
+
73
+ ## Citation
74
+
75
+ If you use this model in your research, please cite:
76
+
77
+ ```
78
+ @misc{nolbert,
79
+ author = {Ali Kakhbod, Peiyao Li},
80
+ title = {NoLBert: A Time-Stamped Pre-Trained LLM},
81
+ year = {2025},
82
+ publisher = {Hugging Face},
83
+ journal = {Hugging Face Model Hub},
84
+ howpublished = {\url{https://huggingface.co/alikLab/NoLBERT}},
85
+ }
86
+ ```