| # PureBit Transformer | |
| **A transformer that operates on raw binary bits instead of tokens.** | |
| ## Architecture | |
| - **Vocab size**: 2 (just 0 and 1!) | |
| - **d_model**: 256 | |
| - **Layers**: 6 | |
| - **Heads**: 8 | |
| - **Parameters**: ~18M | |
| ## Training | |
| - Trained on raw UTF-8 bytes converted to bits | |
| - Best loss achieved: **0.6863** (random = 0.693) | |
| - Training data: ~70MB of text = 560M bits | |
| ## Key Insight | |
| This explores whether transformers can learn at the bit level. Results show minimal learning beyond random - predicting individual bits is extremely hard without byte-level structure. | |
| ## Usage | |
| ```python | |
| import torch | |
| # Load checkpoint | |
| ckpt = torch.load('purebit_best_70mb.pt') | |
| print(f"Loss: {ckpt['loss']:.4f}") | |
| print(f"Bits seen: {ckpt['bits']:,}") | |
| # Model architecture in model.py | |
| ``` | |
| ## Files | |
| - `purebit_best_70mb.pt` - Best checkpoint (loss 0.6863) | |
| - `model.py` - Model architecture | |
| - `train.py` - Training script | |
| - `infer.py` - Inference script | |
| ## Author | |
| OpenTransformers - Experimental architecture research | |