# PureBit Transformer **A transformer that operates on raw binary bits instead of tokens.** ## Architecture - **Vocab size**: 2 (just 0 and 1!) - **d_model**: 256 - **Layers**: 6 - **Heads**: 8 - **Parameters**: ~18M ## Training - Trained on raw UTF-8 bytes converted to bits - Best loss achieved: **0.6863** (random = 0.693) - Training data: ~70MB of text = 560M bits ## Key Insight This explores whether transformers can learn at the bit level. Results show minimal learning beyond random - predicting individual bits is extremely hard without byte-level structure. ## Usage ```python import torch # Load checkpoint ckpt = torch.load('purebit_best_70mb.pt') print(f"Loss: {ckpt['loss']:.4f}") print(f"Bits seen: {ckpt['bits']:,}") # Model architecture in model.py ``` ## Files - `purebit_best_70mb.pt` - Best checkpoint (loss 0.6863) - `model.py` - Model architecture - `train.py` - Training script - `infer.py` - Inference script ## Author OpenTransformers - Experimental architecture research