File size: 1,006 Bytes
bde71d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# PureBit Transformer

**A transformer that operates on raw binary bits instead of tokens.**

## Architecture
- **Vocab size**: 2 (just 0 and 1!)
- **d_model**: 256
- **Layers**: 6
- **Heads**: 8
- **Parameters**: ~18M

## Training
- Trained on raw UTF-8 bytes converted to bits
- Best loss achieved: **0.6863** (random = 0.693)
- Training data: ~70MB of text = 560M bits

## Key Insight
This explores whether transformers can learn at the bit level. Results show minimal learning beyond random - predicting individual bits is extremely hard without byte-level structure.

## Usage
```python
import torch

# Load checkpoint
ckpt = torch.load('purebit_best_70mb.pt')
print(f"Loss: {ckpt['loss']:.4f}")
print(f"Bits seen: {ckpt['bits']:,}")

# Model architecture in model.py
```

## Files
- `purebit_best_70mb.pt` - Best checkpoint (loss 0.6863)
- `model.py` - Model architecture
- `train.py` - Training script
- `infer.py` - Inference script

## Author
OpenTransformers - Experimental architecture research