BitLinear Performance Benchmarks
This document provides detailed performance analysis of BitLinear compared to standard nn.Linear layers.
Memory Compression
BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.
Compression Results
| Layer Size | nn.Linear (MB) | BitLinear Packed (MB) | Compression Ratio |
|---|---|---|---|
| 512×512 | 1.0020 | 0.0539 | 18.59x |
| 768×768 | 2.2529 | 0.1184 | 19.03x |
| 1024×1024 | 4.0039 | 0.2078 | 19.27x |
| 2048×2048 | 16.0078 | 0.8156 | 19.63x |
| 4096×4096 | 64.0156 | 3.2313 | 19.81x |
| 768×3072 | 9.0117 | 0.4734 | 19.03x |
| 1024×4096 | 16.0156 | 0.8313 | 19.27x |
Average Compression: 19.23x (95% of theoretical 20x maximum)
Real-World Example: GPT-2 Small
Configuration:
- 12 Transformer layers
- d_model = 768
- d_ff = 3072
- Total parameters: 84,934,656
Memory Usage:
- nn.Linear: 324.00 MB
- BitLinear (packed): 16.83 MB
- Memory Saved: 307.17 MB
- Compression Ratio: 19.25x
Accuracy Analysis
BitLinear maintains high output similarity despite extreme quantization:
Output Similarity Metrics
From examples/transformer_example.py (Transformer block with 6 linear layers):
- MSE: 0.083
- Cosine Similarity: 0.963 (96.3%)
- Relative Error: 0.279 (27.9%)
Multi-Ternary Improvement
Using k=3 ternary components significantly improves accuracy:
- k=1 Relative Error: 0.501
- k=3 Relative Error: 0.124
- Improvement: 75.1%
Performance Characteristics
Forward Pass Time
Note: Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.
The Python implementation prioritizes correctness and clarity. For production deployments:
- Use C++ CPU kernels for CPU inference
- Use CUDA kernels for GPU inference
- Expect 2-5x speedup from ternary-specific optimizations
Memory vs Speed Trade-off
BitLinear offers different configurations for various use cases:
| Configuration | Memory | Accuracy | Speed |
|---|---|---|---|
| BitLinear (k=1) | 19x less | Good | Fast |
| MultiTernaryLinear (k=2) | 9.5x less | Better | Medium |
| MultiTernaryLinear (k=3) | 6.3x less | Best | Slower |
Packing Efficiency
Base-3 packing achieves near-theoretical compression:
- Theoretical: log₂(3) ≈ 1.58 bits per ternary value
- Actual: 5 ternary values per byte (1.6 bits per value)
- Efficiency: 98.8% of theoretical maximum
Packing Details
- Ternary values {-1, 0, +1} mapped to {0, 1, 2}
- 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
- Maximum packed value: 242 < 256 (fits in uint8)
Use Cases
Ideal For:
- Edge Deployment: Reduced memory footprint for mobile/embedded devices
- Large Models: Significant savings for billion-parameter models
- Inference: Production serving with memory constraints
- Research: Exploring ultra-low-precision neural networks
Considerations:
- Training: Requires quantization-aware training (QAT) for best results
- Accuracy: ~3-5% accuracy drop acceptable for many applications
- Speed: Python implementation slower; use C++/CUDA for production
Benchmarking
Run benchmarks yourself:
# Memory compression analysis
python benchmarks/benchmark_memory.py
# Performance comparison
python benchmarks/benchmark_performance.py
Comparison with Other Methods
| Method | Bits/Weight | Compression | Accuracy | Implementation |
|---|---|---|---|---|
| Float32 | 32 | 1x | Baseline | Standard |
| Float16 | 16 | 2x | ~Baseline | Standard |
| INT8 | 8 | 4x | High | Quantization |
| BitLinear | 1.58 | ~19x | Good | Ternary |
References
- BitNet Paper: Scaling 1-bit Transformers for Large Language Models
- JMLR Paper: Ternary Representations of Neural Networks
Reproducing Results
All benchmarks were run on:
- CPU: AMD Ryzen 9 9950x3d
- GPU: RTX 5090
- PyTorch: 2.9.1+cpu
- Python: 3.13
- CUDA: 12.5
Results may vary based on hardware and PyTorch version.