| # BitLinear Performance Benchmarks | |
| This document provides detailed performance analysis of BitLinear compared to standard `nn.Linear` layers. | |
| ## Memory Compression | |
| BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing. | |
| ### Compression Results | |
| | Layer Size | nn.Linear (MB) | BitLinear Packed (MB) | Compression Ratio | | |
| |------------|----------------|----------------------|-------------------| | |
| | 512×512 | 1.0020 | 0.0539 | 18.59x | | |
| | 768×768 | 2.2529 | 0.1184 | 19.03x | | |
| | 1024×1024 | 4.0039 | 0.2078 | 19.27x | | |
| | 2048×2048 | 16.0078 | 0.8156 | 19.63x | | |
| | 4096×4096 | 64.0156 | 3.2313 | 19.81x | | |
| | 768×3072 | 9.0117 | 0.4734 | 19.03x | | |
| | 1024×4096 | 16.0156 | 0.8313 | 19.27x | | |
| **Average Compression:** 19.23x (95% of theoretical 20x maximum) | |
| ### Real-World Example: GPT-2 Small | |
| Configuration: | |
| - 12 Transformer layers | |
| - d_model = 768 | |
| - d_ff = 3072 | |
| - Total parameters: 84,934,656 | |
| Memory Usage: | |
| - **nn.Linear:** 324.00 MB | |
| - **BitLinear (packed):** 16.83 MB | |
| - **Memory Saved:** 307.17 MB | |
| - **Compression Ratio:** 19.25x | |
| ## Accuracy Analysis | |
| BitLinear maintains high output similarity despite extreme quantization: | |
| ### Output Similarity Metrics | |
| From `examples/transformer_example.py` (Transformer block with 6 linear layers): | |
| - **MSE:** 0.083 | |
| - **Cosine Similarity:** 0.963 (96.3%) | |
| - **Relative Error:** 0.279 (27.9%) | |
| ### Multi-Ternary Improvement | |
| Using k=3 ternary components significantly improves accuracy: | |
| - **k=1 Relative Error:** 0.501 | |
| - **k=3 Relative Error:** 0.124 | |
| - **Improvement:** 75.1% | |
| ## Performance Characteristics | |
| ### Forward Pass Time | |
| > **Note:** Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use. | |
| The Python implementation prioritizes correctness and clarity. For production deployments: | |
| - Use C++ CPU kernels for CPU inference | |
| - Use CUDA kernels for GPU inference | |
| - Expect 2-5x speedup from ternary-specific optimizations | |
| ### Memory vs Speed Trade-off | |
| BitLinear offers different configurations for various use cases: | |
| | Configuration | Memory | Accuracy | Speed | | |
| |--------------|--------|----------|-------| | |
| | BitLinear (k=1) | 19x less | Good | Fast | | |
| | MultiTernaryLinear (k=2) | 9.5x less | Better | Medium | | |
| | MultiTernaryLinear (k=3) | 6.3x less | Best | Slower | | |
| ## Packing Efficiency | |
| Base-3 packing achieves near-theoretical compression: | |
| - **Theoretical:** log₂(3) ≈ 1.58 bits per ternary value | |
| - **Actual:** 5 ternary values per byte (1.6 bits per value) | |
| - **Efficiency:** 98.8% of theoretical maximum | |
| ### Packing Details | |
| - Ternary values {-1, 0, +1} mapped to {0, 1, 2} | |
| - 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄ | |
| - Maximum packed value: 242 < 256 (fits in uint8) | |
| ## Use Cases | |
| ### Ideal For: | |
| - **Edge Deployment:** Reduced memory footprint for mobile/embedded devices | |
| - **Large Models:** Significant savings for billion-parameter models | |
| - **Inference:** Production serving with memory constraints | |
| - **Research:** Exploring ultra-low-precision neural networks | |
| ### Considerations: | |
| - **Training:** Requires quantization-aware training (QAT) for best results | |
| - **Accuracy:** ~3-5% accuracy drop acceptable for many applications | |
| - **Speed:** Python implementation slower; use C++/CUDA for production | |
| ## Benchmarking | |
| Run benchmarks yourself: | |
| ```bash | |
| # Memory compression analysis | |
| python benchmarks/benchmark_memory.py | |
| # Performance comparison | |
| python benchmarks/benchmark_performance.py | |
| ``` | |
| ## Comparison with Other Methods | |
| | Method | Bits/Weight | Compression | Accuracy | Implementation | | |
| |--------|-------------|-------------|----------|----------------| | |
| | Float32 | 32 | 1x | Baseline | Standard | | |
| | Float16 | 16 | 2x | ~Baseline | Standard | | |
| | INT8 | 8 | 4x | High | Quantization | | |
| | **BitLinear** | **1.58** | **~19x** | **Good** | **Ternary** | | |
| ## References | |
| - **BitNet Paper:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453) | |
| - **JMLR Paper:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf) | |
| ## Reproducing Results | |
| All benchmarks were run on: | |
| - CPU: AMD Ryzen 9 9950x3d | |
| - GPU: RTX 5090 | |
| - PyTorch: 2.9.1+cpu | |
| - Python: 3.13 | |
| - CUDA: 12.5 | |
| Results may vary based on hardware and PyTorch version. | |