davidquarel commited on
Commit
494adaf
·
verified ·
1 Parent(s): 8a0caf0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TinyStories Training data, tokenized with GPT-2 tokenizer.
2
+
3
+ Truncated at 512 tokens, no BOS token, padded with token id 50256 = GPT2-BOS token
4
+
5
+ Trunated dataset generated as follows
6
+
7
+ ```python
8
+ import torch
9
+ my_tokens = torch.load("train_tokens.pt")
10
+ l = my_tokens['attention_mask'].sum(dim=-1)
11
+ mask = ((170 <= l) & (l <= 180))
12
+
13
+ train_215k = {}
14
+ train_215k['input_ids'] = my_tokens['input_ids'][mask]
15
+ train_215k['attention_mask'] = my_tokens['attention_mask'][mask]
16
+
17
+ torch.save(train_215k, "train_tokens_215k.pt")
18
+ check = torch.load("train_tokens_215k_2.pt")
19
+
20
+ gap = (check['input_ids'] == 50256).sum(dim=-1)
21
+ gap2 = (check['attention_mask'] == 0).sum(dim=-1)
22
+
23
+ assert torch.all(gap == gap2)
24
+
25
+
26
+ ```