File size: 4,686 Bytes
fd8c8b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# BitLinear Performance Benchmarks
This document provides detailed performance analysis of BitLinear compared to standard `nn.Linear` layers.
## Memory Compression
BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.
### Compression Results
| Layer Size | nn.Linear (MB) | BitLinear Packed (MB) | Compression Ratio |
|------------|----------------|----------------------|-------------------|
| 512×512 | 1.0020 | 0.0539 | 18.59x |
| 768×768 | 2.2529 | 0.1184 | 19.03x |
| 1024×1024 | 4.0039 | 0.2078 | 19.27x |
| 2048×2048 | 16.0078 | 0.8156 | 19.63x |
| 4096×4096 | 64.0156 | 3.2313 | 19.81x |
| 768×3072 | 9.0117 | 0.4734 | 19.03x |
| 1024×4096 | 16.0156 | 0.8313 | 19.27x |
**Average Compression:** 19.23x (95% of theoretical 20x maximum)
### Real-World Example: GPT-2 Small
Configuration:
- 12 Transformer layers
- d_model = 768
- d_ff = 3072
- Total parameters: 84,934,656
Memory Usage:
- **nn.Linear:** 324.00 MB
- **BitLinear (packed):** 16.83 MB
- **Memory Saved:** 307.17 MB
- **Compression Ratio:** 19.25x
## Accuracy Analysis
BitLinear maintains high output similarity despite extreme quantization:
### Output Similarity Metrics
From `examples/transformer_example.py` (Transformer block with 6 linear layers):
- **MSE:** 0.083
- **Cosine Similarity:** 0.963 (96.3%)
- **Relative Error:** 0.279 (27.9%)
### Multi-Ternary Improvement
Using k=3 ternary components significantly improves accuracy:
- **k=1 Relative Error:** 0.501
- **k=3 Relative Error:** 0.124
- **Improvement:** 75.1%
## Performance Characteristics
### Forward Pass Time
> **Note:** Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.
The Python implementation prioritizes correctness and clarity. For production deployments:
- Use C++ CPU kernels for CPU inference
- Use CUDA kernels for GPU inference
- Expect 2-5x speedup from ternary-specific optimizations
### Memory vs Speed Trade-off
BitLinear offers different configurations for various use cases:
| Configuration | Memory | Accuracy | Speed |
|--------------|--------|----------|-------|
| BitLinear (k=1) | 19x less | Good | Fast |
| MultiTernaryLinear (k=2) | 9.5x less | Better | Medium |
| MultiTernaryLinear (k=3) | 6.3x less | Best | Slower |
## Packing Efficiency
Base-3 packing achieves near-theoretical compression:
- **Theoretical:** log₂(3) ≈ 1.58 bits per ternary value
- **Actual:** 5 ternary values per byte (1.6 bits per value)
- **Efficiency:** 98.8% of theoretical maximum
### Packing Details
- Ternary values {-1, 0, +1} mapped to {0, 1, 2}
- 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
- Maximum packed value: 242 < 256 (fits in uint8)
## Use Cases
### Ideal For:
- **Edge Deployment:** Reduced memory footprint for mobile/embedded devices
- **Large Models:** Significant savings for billion-parameter models
- **Inference:** Production serving with memory constraints
- **Research:** Exploring ultra-low-precision neural networks
### Considerations:
- **Training:** Requires quantization-aware training (QAT) for best results
- **Accuracy:** ~3-5% accuracy drop acceptable for many applications
- **Speed:** Python implementation slower; use C++/CUDA for production
## Benchmarking
Run benchmarks yourself:
```bash
# Memory compression analysis
python benchmarks/benchmark_memory.py
# Performance comparison
python benchmarks/benchmark_performance.py
```
## Comparison with Other Methods
| Method | Bits/Weight | Compression | Accuracy | Implementation |
|--------|-------------|-------------|----------|----------------|
| Float32 | 32 | 1x | Baseline | Standard |
| Float16 | 16 | 2x | ~Baseline | Standard |
| INT8 | 8 | 4x | High | Quantization |
| **BitLinear** | **1.58** | **~19x** | **Good** | **Ternary** |
## References
- **BitNet Paper:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
- **JMLR Paper:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)
## Reproducing Results
All benchmarks were run on:
- CPU: AMD Ryzen 9 9950x3d
- GPU: RTX 5090
- PyTorch: 2.9.1+cpu
- Python: 3.13
- CUDA: 12.5
Results may vary based on hardware and PyTorch version.
|