File size: 691 Bytes
494adaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
TinyStories Training data, tokenized with GPT-2 tokenizer.
Truncated at 512 tokens, no BOS token, padded with token id 50256 = GPT2-BOS token
Trunated dataset generated as follows
```python
import torch
my_tokens = torch.load("train_tokens.pt")
l = my_tokens['attention_mask'].sum(dim=-1)
mask = ((170 <= l) & (l <= 180))
train_215k = {}
train_215k['input_ids'] = my_tokens['input_ids'][mask]
train_215k['attention_mask'] = my_tokens['attention_mask'][mask]
torch.save(train_215k, "train_tokens_215k.pt")
check = torch.load("train_tokens_215k_2.pt")
gap = (check['input_ids'] == 50256).sum(dim=-1)
gap2 = (check['attention_mask'] == 0).sum(dim=-1)
assert torch.all(gap == gap2)
``` |