# Wire-Speed Transformer: Real-Time Learning from Live Network Streams

**A novel approach to transformer training that learns directly from network traffic in real-time.**

## 🔥 Key Results

| Time | Tokens | Loss | Notes |
|------|--------|------|-------|
| 0s | 0 | - | Start |
| 14s | 10k | 50.08 | Initial |
| 192s | 100k | 22.32 | -55% |
| 302s | 170k | 16.78 | -66% |
| 355s | 190k | 15.91 | **-68%** |

**Loss dropped from 50 → 16 in under 6 minutes using only 32-token micro-batches from raw, uncurated web data.**

## 🧠 What Makes This Different

Traditional transformer training requires:
- Large batch sizes (4096+)
- Multiple epochs over curated data
- Expensive preprocessing pipelines
- Hours/days of training

Wire-Speed Learning uses:
- **32-token micro-batches** (125x smaller)
- **Single pass** (no epochs)
- **Raw web data** (no curation)
- **Online SGD** (update every 32 tokens)
- **Real-time network stream** (Rust crawler → Python trainer)

## 🏗️ Architecture

```
┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Rust Crawler   │────▶│  Tokenizer   │────▶│ Python Trainer  │
│  (500 workers)  │     │ (DeepSeek)   │     │  (36M params)   │
│  ~500 pages/s   │     │  128k vocab  │     │  ~500 tok/s     │
└─────────────────┘     └──────────────┘     └─────────────────┘
         │                                           │
         ▼                                           ▼
   Live Internet                              Gradient Update
   (no robots.txt)                            (every 32 tokens)
```

## 📊 Model Config

```python
CONFIG = {
    "d": 256,        # embedding dim
    "layers": 4,     # transformer layers
    "heads": 8,      # attention heads
    "rank": 32,      # tuneable attention rank
    "vocab": 128256, # DeepSeek V3.2 tokenizer
    "ctx": 512,      # context window
}
# Total: 35,993,088 parameters (36M)
```

## 🚀 Quick Start

### Requirements
- CUDA GPU (8GB+ VRAM)
- Rust toolchain
- Python 3.8+
- PyTorch 2.0+

### Installation

```bash
# Clone
git clone https://huggingface.co/OpenTransformer/wire-speed-transformer
cd wire-speed-transformer

# Install Rust (if needed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source ~/.cargo/env

# Build Rust crawler
cd feeder && cargo build --release && cd ..

# Download DeepSeek tokenizer
curl -sL https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/tokenizer.json -o tokenizer.json

# Install Python deps
pip install torch

# Run!
./feeder/target/release/wire_feeder 2>feeder.log | python3 stream_trainer.py
```

## 📁 Files

- `stream_trainer.py` - Python transformer trainer (online learning)
- `feeder/` - Rust high-speed web crawler + tokenizer
- `tokenizer.json` - DeepSeek V3.2 tokenizer (download separately)
- `run.sh` - Launch script

## 🔬 Why This Works (Hypotheses)

1. **Small models converge faster** - 36M params needs less data than 7B
2. **High update frequency** - More gradient signal despite noise
3. **Web has structure** - HTML patterns, common phrases provide learning signal
4. **DeepSeek tokenizer** - High-quality tokenization from SOTA model

## ⚠️ Limitations

- No evaluation yet (just training loss)
- Model is tiny (36M) - won't match GPT-4
- Catastrophic forgetting not measured
- Raw web data quality unknown

## 📝 Citation

```bibtex
@misc{wirespeed2026,
  title={Wire-Speed Transformer: Real-Time Learning from Live Network Streams},
  author={OpenTransformers},
  year={2026},
  url={https://huggingface.co/OpenTransformer/wire-speed-transformer}
}
```

## 🙏 Acknowledgments

- DeepSeek for the tokenizer
- Anthropic's Claude for pair programming
- vast.ai for GPU compute

## 📜 License

MIT

---

*Built by OpenTransformers - Pushing the boundaries of what's possible with transformers.*