| TinyStories Training data, tokenized with GPT-2 tokenizer. | |
| Truncated at 512 tokens, no BOS token, padded with token id 50256 = GPT2-BOS token | |
| Trunated dataset generated as follows | |
| ```python | |
| import torch | |
| my_tokens = torch.load("train_tokens.pt") | |
| l = my_tokens['attention_mask'].sum(dim=-1) | |
| mask = ((170 <= l) & (l <= 180)) | |
| train_215k = {} | |
| train_215k['input_ids'] = my_tokens['input_ids'][mask] | |
| train_215k['attention_mask'] = my_tokens['attention_mask'][mask] | |
| torch.save(train_215k, "train_tokens_215k.pt") | |
| check = torch.load("train_tokens_215k_2.pt") | |
| gap = (check['input_ids'] == 50256).sum(dim=-1) | |
| gap2 = (check['attention_mask'] == 0).sum(dim=-1) | |
| assert torch.all(gap == gap2) | |
| ``` |