davidquarel
/

advint_data

Model card Files Files and versions

davidquarel commited on Apr 16, 2025

Commit

494adaf

·

verified ·

1 Parent(s): 8a0caf0

Create README.md

Files changed (1) hide show

README.md +26 -0

README.md ADDED Viewed

	@@ -0,0 +1,26 @@

+TinyStories Training data, tokenized with GPT-2 tokenizer.
+Truncated at 512 tokens, no BOS token, padded with token id 50256 = GPT2-BOS token
+Trunated dataset generated as follows
+```python
+import torch
+my_tokens = torch.load("train_tokens.pt")
+l = my_tokens['attention_mask'].sum(dim=-1)
+mask = ((170 <= l) & (l <= 180))
+train_215k = {}
+train_215k['input_ids'] = my_tokens['input_ids'][mask]
+train_215k['attention_mask'] = my_tokens['attention_mask'][mask]
+torch.save(train_215k, "train_tokens_215k.pt")
+check = torch.load("train_tokens_215k_2.pt")
+gap = (check['input_ids'] == 50256).sum(dim=-1)
+gap2 = (check['attention_mask'] == 0).sum(dim=-1)
+assert torch.all(gap == gap2)
+```