| # Data Module |
|
|
| This module handles all data preprocessing, tokenization, and preparation for training. |
|
|
| ## Overview |
|
|
| The data pipeline converts raw text into binary token files optimized for training: |
| - **Raw text collection** from multiple sources |
| - **Tokenization** using BPE tokenizer |
| - **Binary serialization** for efficient loading |
| - **Train/validation splitting** |
|
|
| ## Directory Structure |
|
|
| ``` |
| data/ |
| βββ raw/ # Raw text sources |
| β βββ books/ # Book corpus |
| β βββ wikipedia/ # Wikipedia dumps |
| β βββ fineweb/ # Web crawl data |
| β βββ merged_text/ |
| β βββ corpus.txt # Combined corpus |
| βββ bin/ # Tokenized binary files |
| β βββ train.bin # Training data (uint16) |
| β βββ val.bin # Validation data (uint16) |
| βββ prepare_data.py # Tokenization script |
| ``` |
|
|
| ## Data Processing Pipeline |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| β 1. Raw Text Sources β |
| β - Books: 15 files β |
| β - Wikipedia: 3 dumps β |
| β - FineWeb: 1 crawl β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| β 2. Merge & Clean β |
| β β corpus.txt (all text combined) β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| β 3. Tokenize (prepare_data.py) β |
| β - Load BPE tokenizer β |
| β - Process line-by-line β |
| β - Append EOS tokens β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| β 4. Convert to NumPy (uint16) β |
| β - Vocab size: 32,000 fits in uint16 β |
| β - Memory efficient (2 bytes/token) β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| β 5. Train/Val Split (90/10) β |
| β - train.bin: 325M tokens β |
| β - val.bin: 36M tokens β |
| βββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Data Preparation Script |
|
|
| **File**: `prepare_data.py` |
|
|
| ```python |
| import numpy as np |
| from transformers import AutoTokenizer |
| from tqdm import tqdm |
| |
| # 1. Load tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE") |
| eos_id = tokenizer.eos_token_id |
| |
| # 2. Read corpus |
| with open("data/raw/merged_text/corpus.txt") as f: |
| lines = f.readlines() |
| |
| # 3. Tokenize |
| all_tokens = [] |
| for line in tqdm(lines): |
| tokens = tokenizer.encode(line.strip()) |
| tokens.append(eos_id) # Mark end of line |
| all_tokens.extend(tokens) |
| |
| # 4. Convert to uint16 |
| ids = np.array(all_tokens, dtype=np.uint16) |
| |
| # 5. Split |
| val_count = int(len(ids) * 0.1) |
| train_ids = ids[:-val_count] |
| val_ids = ids[-val_count:] |
| |
| # 6. Save |
| train_ids.tofile("data/bin/train.bin") |
| val_ids.tofile("data/bin/val.bin") |
| ``` |
|
|
| ## Example: Text β Tokens |
|
|
| **Input Text** (`corpus.txt`): |
| ``` |
| The quick brown fox jumps over the lazy dog. |
| Machine learning is transforming the world. |
| ``` |
|
|
| **Tokenization Process**: |
|
|
| ``` |
| Line 1: "The quick brown fox jumps over the lazy dog." |
| Tokens: [1, 334, 3855, 288, 267, 2959, 354, 267, 12397, 8885, 2] |
| [<s>, The, quick, brown, fox, jumps, over, the, lazy, dog, </s>] |
| |
| Line 2: "Machine learning is transforming the world." |
| Tokens: [1, 5234, 1234, 456, 7890, 267, 9876, 2] |
| [<s>, Machine, learning, is, transforming, the, world, </s>] |
| |
| Combined: [1, 334, 3855, ..., 2, 1, 5234, ..., 2] |
| ``` |
|
|
| **Binary Format**: |
|
|
| ``` |
| train.bin structure: |
| Byte 0-1: Token 0 (uint16) |
| Byte 2-3: Token 1 (uint16) |
| Byte 4-5: Token 2 (uint16) |
| ... |
| Byte N-2:N Token N/2 (uint16) |
| |
| Total size: 325,004,796 tokens Γ 2 bytes = ~650 MB |
| ``` |
|
|
| ## Dataset Statistics |
|
|
| ### Corpus Size |
|
|
| ``` |
| Raw Text: |
| - Total files: 19 |
| - Total size: ~1.4 GB |
| - Total lines: ~5.2M |
| |
| Tokenized: |
| - Total tokens: 361,116,440 |
| - Train tokens: 325,004,796 (90%) |
| - Val tokens: 36,111,644 (10%) |
| ``` |
|
|
| ## Usage |
|
|
| ### Prepare Data |
|
|
| ```bash |
| # Tokenize corpus |
| python data/prepare_data.py |
| ``` |
|
|
| **Output:** |
| ``` |
| Loading tokenizer from Tokenizer/BPE... |
| Vocab size: 32000 |
| EOS ID: 2 |
| Reading data/raw/merged_text/corpus.txt... |
| Total lines: 5,234,567 |
| Tokenizing... |
| 100%|ββββββββββββ| 5.2M/5.2M [02:34<00:00] |
| Total tokens: 361,116,440 |
| Train tokens: 325,004,796 |
| Val tokens: 36,111,644 |
| β
Saved binary files to data/bin/ |
| ``` |
|
|
| ### Load in Training |
|
|
| ```python |
| from train.dataloader import DataLoader |
| |
| loader = DataLoader("data/bin", batch_size=16, block_size=512, split="train") |
| x, y = loader.get_batch(device="cuda") |
| |
| # x: [16, 512] input tokens |
| # y: [16, 512] target tokens (shifted by 1) |
| ``` |
|
|
| ## Memory-Mapped Loading |
|
|
| The binary files are loaded using `np.memmap` for efficiency: |
|
|
| ```python |
| # Traditional loading (BAD) |
| data = np.fromfile("train.bin", dtype=np.uint16) # Loads 650MB into RAM! |
| |
| # Memory-mapped loading (GOOD) |
| data = np.memmap("train.bin", dtype=np.uint16, mode='r') # OS handles paging |
| ``` |
|
|
| **Benefits:** |
| - **No RAM overhead**: File stays on disk |
| - **Fast random access**: OS caches hot pages |
| - **Scalable**: Works with TB-scale datasets |
|
|
| ## References |
|
|
| - [The Pile: An 800GB Dataset](https://arxiv.org/abs/2101.00027) |
| - [Data Quality for Language Models](https://arxiv.org/abs/2201.06009) |
| - [Efficient Data Loading](https://pytorch.org/docs/stable/data.html) |
|
|