Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wire-Speed Transformer: Real-Time Learning from Live Network Streams
|
| 2 |
+
|
| 3 |
+
**A novel approach to transformer training that learns directly from network traffic in real-time.**
|
| 4 |
+
|
| 5 |
+
## π₯ Key Results
|
| 6 |
+
|
| 7 |
+
| Time | Tokens | Loss | Notes |
|
| 8 |
+
|------|--------|------|-------|
|
| 9 |
+
| 0s | 0 | - | Start |
|
| 10 |
+
| 14s | 10k | 50.08 | Initial |
|
| 11 |
+
| 192s | 100k | 22.32 | -55% |
|
| 12 |
+
| 302s | 170k | 16.78 | -66% |
|
| 13 |
+
| 355s | 190k | 15.91 | **-68%** |
|
| 14 |
+
|
| 15 |
+
**Loss dropped from 50 β 16 in under 6 minutes using only 32-token micro-batches from raw, uncurated web data.**
|
| 16 |
+
|
| 17 |
+
## π§ What Makes This Different
|
| 18 |
+
|
| 19 |
+
Traditional transformer training requires:
|
| 20 |
+
- Large batch sizes (4096+)
|
| 21 |
+
- Multiple epochs over curated data
|
| 22 |
+
- Expensive preprocessing pipelines
|
| 23 |
+
- Hours/days of training
|
| 24 |
+
|
| 25 |
+
Wire-Speed Learning uses:
|
| 26 |
+
- **32-token micro-batches** (125x smaller)
|
| 27 |
+
- **Single pass** (no epochs)
|
| 28 |
+
- **Raw web data** (no curation)
|
| 29 |
+
- **Online SGD** (update every 32 tokens)
|
| 30 |
+
- **Real-time network stream** (Rust crawler β Python trainer)
|
| 31 |
+
|
| 32 |
+
## ποΈ Architecture
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
|
| 36 |
+
β Rust Crawler ββββββΆβ Tokenizer ββββββΆβ Python Trainer β
|
| 37 |
+
β (500 workers) β β (DeepSeek) β β (36M params) β
|
| 38 |
+
β ~500 pages/s β β 128k vocab β β ~500 tok/s β
|
| 39 |
+
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
|
| 40 |
+
β β
|
| 41 |
+
βΌ βΌ
|
| 42 |
+
Live Internet Gradient Update
|
| 43 |
+
(no robots.txt) (every 32 tokens)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
## π Model Config
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
CONFIG = {
|
| 50 |
+
"d": 256, # embedding dim
|
| 51 |
+
"layers": 4, # transformer layers
|
| 52 |
+
"heads": 8, # attention heads
|
| 53 |
+
"rank": 32, # tuneable attention rank
|
| 54 |
+
"vocab": 128256, # DeepSeek V3.2 tokenizer
|
| 55 |
+
"ctx": 512, # context window
|
| 56 |
+
}
|
| 57 |
+
# Total: 35,993,088 parameters (36M)
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## π Quick Start
|
| 61 |
+
|
| 62 |
+
### Requirements
|
| 63 |
+
- CUDA GPU (8GB+ VRAM)
|
| 64 |
+
- Rust toolchain
|
| 65 |
+
- Python 3.8+
|
| 66 |
+
- PyTorch 2.0+
|
| 67 |
+
|
| 68 |
+
### Installation
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
# Clone
|
| 72 |
+
git clone https://huggingface.co/OpenTransformer/wire-speed-transformer
|
| 73 |
+
cd wire-speed-transformer
|
| 74 |
+
|
| 75 |
+
# Install Rust (if needed)
|
| 76 |
+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
|
| 77 |
+
source ~/.cargo/env
|
| 78 |
+
|
| 79 |
+
# Build Rust crawler
|
| 80 |
+
cd feeder && cargo build --release && cd ..
|
| 81 |
+
|
| 82 |
+
# Download DeepSeek tokenizer
|
| 83 |
+
curl -sL https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/tokenizer.json -o tokenizer.json
|
| 84 |
+
|
| 85 |
+
# Install Python deps
|
| 86 |
+
pip install torch
|
| 87 |
+
|
| 88 |
+
# Run!
|
| 89 |
+
./feeder/target/release/wire_feeder 2>feeder.log | python3 stream_trainer.py
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
## π Files
|
| 93 |
+
|
| 94 |
+
- `stream_trainer.py` - Python transformer trainer (online learning)
|
| 95 |
+
- `feeder/` - Rust high-speed web crawler + tokenizer
|
| 96 |
+
- `tokenizer.json` - DeepSeek V3.2 tokenizer (download separately)
|
| 97 |
+
- `run.sh` - Launch script
|
| 98 |
+
|
| 99 |
+
## π¬ Why This Works (Hypotheses)
|
| 100 |
+
|
| 101 |
+
1. **Small models converge faster** - 36M params needs less data than 7B
|
| 102 |
+
2. **High update frequency** - More gradient signal despite noise
|
| 103 |
+
3. **Web has structure** - HTML patterns, common phrases provide learning signal
|
| 104 |
+
4. **DeepSeek tokenizer** - High-quality tokenization from SOTA model
|
| 105 |
+
|
| 106 |
+
## β οΈ Limitations
|
| 107 |
+
|
| 108 |
+
- No evaluation yet (just training loss)
|
| 109 |
+
- Model is tiny (36M) - won't match GPT-4
|
| 110 |
+
- Catastrophic forgetting not measured
|
| 111 |
+
- Raw web data quality unknown
|
| 112 |
+
|
| 113 |
+
## π Citation
|
| 114 |
+
|
| 115 |
+
```bibtex
|
| 116 |
+
@misc{wirespeed2026,
|
| 117 |
+
title={Wire-Speed Transformer: Real-Time Learning from Live Network Streams},
|
| 118 |
+
author={OpenTransformers},
|
| 119 |
+
year={2026},
|
| 120 |
+
url={https://huggingface.co/OpenTransformer/wire-speed-transformer}
|
| 121 |
+
}
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## π Acknowledgments
|
| 125 |
+
|
| 126 |
+
- DeepSeek for the tokenizer
|
| 127 |
+
- Anthropic's Claude for pair programming
|
| 128 |
+
- vast.ai for GPU compute
|
| 129 |
+
|
| 130 |
+
## π License
|
| 131 |
+
|
| 132 |
+
MIT
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
*Built by OpenTransformers - Pushing the boundaries of what's possible with transformers.*
|