| | --- |
| | license: mit |
| | tags: |
| | - binary-neural-network |
| | - zero-tokenization |
| | - wire-speed-learning |
| | - bit-level |
| | - byte-level |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Binary Transformers: Learning Language from Raw Binary |
| |
|
| | **Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.** |
| |
|
| | This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning: |
| |
|
| | | Model | Vocab | Input | Weights | Description | |
| | |-------|-------|-------|---------|-------------| |
| | | **Byte-level** | 256 | bytes (0x00-0xFF) | real | One token per byte value | |
| | | **Bit-level** | 2 | bits (0, 1) | real | Pure binary, 8 tokens per byte | |
| | | **Dibit** | 4 | dibits (00,01,10,11) | real | 2-bit tokens, 4 per byte | |
| | | **Pure Binary** | 2 | bits (0, 1) | **binary (-1/+1)** | BITS ALL THE WAY DOWN | |
| |
|
| | ## Why? |
| |
|
| | Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates: |
| | - Tokenizer overhead and complexity |
| | - Language/domain bias baked into vocabulary |
| | - Preprocessing bottleneck |
| |
|
| | **What if we eliminated tokenization entirely?** |
| |
|
| | These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: **wire-speed learning** where models absorb network traffic in real-time. |
| |
|
| | ## Results (Live Experiments - 16 Jan 2026) |
| |
|
| | ### Byte-Level (vocab=256) |
| | ``` |
| | Data: 350KB web crawl |
| | BPB: 4.68 (vs 8.0 random = 41% compression) |
| | Speed: 8.7 KB/s learning rate |
| | Params: 0.6M |
| | ``` |
| | Learns HTML structure, XML tags, timestamps from raw bytes. |
| |
|
| | ### Bit-Level (vocab=2) |
| | ``` |
| | Data: 550KB |
| | Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression) |
| | Speed: 0.7 KB/s |
| | Params: 85M |
| | ``` |
| | Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s. |
| |
|
| | ### Dibit (vocab=4: 00,01,10,11) |
| | ``` |
| | Data: 437KB |
| | BPB: 7.55 (vs 8.0 random = 5.7% compression) |
| | Speed: 0.25 KB/s |
| | Params: 37.8M |
| | ``` |
| | 2-bit tokens provide 2x context efficiency vs bit-level. **Best compression so far!** |
| |
|
| | ### Pure Binary (vocab=2, binary weights) |
| | ``` |
| | Data: 806KB |
| | Entropy: 0.995 bit/bit (0.5% compression) |
| | Binary params: 99.8% |
| | Params: 4.7M |
| | ``` |
| | **BITS ALL THE WAY DOWN** - input bits, binary weights (-1/+1), output bits. |
| | On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate. |
| |
|
| | ## Architecture |
| |
|
| | All models use standard transformer architecture with: |
| | - Causal self-attention |
| | - GELU activation |
| | - LayerNorm |
| | - AdamW optimizer |
| | - Straight-Through Estimator (STE) for binary weight gradients |
| |
|
| | ### Key Innovation: Online Learning |
| |
|
| | Unlike traditional batch training, these models learn from streaming data: |
| | - Micro-batches (32-512 tokens) |
| | - Single-pass, no data curation |
| | - Real-time network stream compatible |
| |
|
| | ## Usage |
| |
|
| | ### Byte-Level |
| | ```bash |
| | # Pipe any data source |
| | cat data.bin | python byte_trainer.py |
| | curl -s http://example.com | python byte_trainer.py |
| | zcat crawl.jsonl.gz | python byte_trainer.py |
| | ``` |
| |
|
| | ### Bit-Level |
| | ```bash |
| | cat data.bin | python bit_trainer.py |
| | ``` |
| |
|
| | ### Dibit (2-bit tokens) |
| | ```bash |
| | cat data.bin | python dibit_trainer.py |
| | ``` |
| |
|
| | ### Pure Binary (binary weights) |
| | ```bash |
| | cat data.bin | python purebit_trainer.py |
| | ``` |
| |
|
| | ## Configuration |
| |
|
| | Edit the CONFIG dict in each trainer: |
| |
|
| | ```python |
| | CONFIG = { |
| | "d": 256, # embedding dimension |
| | "layers": 6, # transformer layers |
| | "heads": 8, # attention heads |
| | "vocab": 2, # vocabulary size |
| | "ctx": 2048, # context length |
| | } |
| | ``` |
| |
|
| | ## Files |
| |
|
| | ``` |
| | byte_trainer.py # Vocab=256, one token per byte |
| | bit_trainer.py # Vocab=2, pure bits |
| | dibit_trainer.py # Vocab=4, 2-bit tokens (00,01,10,11) |
| | purebit_trainer.py # Vocab=2 + binary weights (-1/+1) |
| | ``` |
| |
|
| | ## Insights |
| |
|
| | 1. **Byte-level is sweet spot** - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead |
| |
|
| | 2. **Bit-level works but slow** - 8x longer sequences mean 8x less context per forward pass |
| |
|
| | 3. **Dibit balances** - 2-bit tokens give 2x context vs bit-level while staying "pure binary" |
| |
|
| | 4. **Binary weights viable** - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups |
| |
|
| | 5. **HTML is natural SFT** - Web data contains instruction-following patterns: `<h3>Question</h3><p>Answer`, `<dt>Term</dt><dd>Definition</dd>`, JSON Q&A |
| |
|
| | ## Future Work |
| |
|
| | - Scale to billions of parameters |
| | - Custom CUDA kernels for binary ops (XNOR + popcount) |
| | - FPGA/ASIC implementation for true wire-speed learning |
| | - Hierarchical binary models (bit → byte → word emergence) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{opentransformer2026binary, |
| | title={Binary Transformers: Learning Language from Raw Binary}, |
| | author={OpenTransformer}, |
| | year={2026}, |
| | publisher={HuggingFace}, |
| | url={https://huggingface.co/OpenTransformer/binary-transformers} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|
| | ## Acknowledgments |
| |
|
| | Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project. |
| |
|