BitLinear / BENCHMARKS.md
krisaujla's picture
Upload folder using huggingface_hub
fd8c8b9 verified
# BitLinear Performance Benchmarks
This document provides detailed performance analysis of BitLinear compared to standard `nn.Linear` layers.
## Memory Compression
BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.
### Compression Results
| Layer Size | nn.Linear (MB) | BitLinear Packed (MB) | Compression Ratio |
|------------|----------------|----------------------|-------------------|
| 512×512 | 1.0020 | 0.0539 | 18.59x |
| 768×768 | 2.2529 | 0.1184 | 19.03x |
| 1024×1024 | 4.0039 | 0.2078 | 19.27x |
| 2048×2048 | 16.0078 | 0.8156 | 19.63x |
| 4096×4096 | 64.0156 | 3.2313 | 19.81x |
| 768×3072 | 9.0117 | 0.4734 | 19.03x |
| 1024×4096 | 16.0156 | 0.8313 | 19.27x |
**Average Compression:** 19.23x (95% of theoretical 20x maximum)
### Real-World Example: GPT-2 Small
Configuration:
- 12 Transformer layers
- d_model = 768
- d_ff = 3072
- Total parameters: 84,934,656
Memory Usage:
- **nn.Linear:** 324.00 MB
- **BitLinear (packed):** 16.83 MB
- **Memory Saved:** 307.17 MB
- **Compression Ratio:** 19.25x
## Accuracy Analysis
BitLinear maintains high output similarity despite extreme quantization:
### Output Similarity Metrics
From `examples/transformer_example.py` (Transformer block with 6 linear layers):
- **MSE:** 0.083
- **Cosine Similarity:** 0.963 (96.3%)
- **Relative Error:** 0.279 (27.9%)
### Multi-Ternary Improvement
Using k=3 ternary components significantly improves accuracy:
- **k=1 Relative Error:** 0.501
- **k=3 Relative Error:** 0.124
- **Improvement:** 75.1%
## Performance Characteristics
### Forward Pass Time
> **Note:** Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.
The Python implementation prioritizes correctness and clarity. For production deployments:
- Use C++ CPU kernels for CPU inference
- Use CUDA kernels for GPU inference
- Expect 2-5x speedup from ternary-specific optimizations
### Memory vs Speed Trade-off
BitLinear offers different configurations for various use cases:
| Configuration | Memory | Accuracy | Speed |
|--------------|--------|----------|-------|
| BitLinear (k=1) | 19x less | Good | Fast |
| MultiTernaryLinear (k=2) | 9.5x less | Better | Medium |
| MultiTernaryLinear (k=3) | 6.3x less | Best | Slower |
## Packing Efficiency
Base-3 packing achieves near-theoretical compression:
- **Theoretical:** log₂(3) ≈ 1.58 bits per ternary value
- **Actual:** 5 ternary values per byte (1.6 bits per value)
- **Efficiency:** 98.8% of theoretical maximum
### Packing Details
- Ternary values {-1, 0, +1} mapped to {0, 1, 2}
- 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
- Maximum packed value: 242 < 256 (fits in uint8)
## Use Cases
### Ideal For:
- **Edge Deployment:** Reduced memory footprint for mobile/embedded devices
- **Large Models:** Significant savings for billion-parameter models
- **Inference:** Production serving with memory constraints
- **Research:** Exploring ultra-low-precision neural networks
### Considerations:
- **Training:** Requires quantization-aware training (QAT) for best results
- **Accuracy:** ~3-5% accuracy drop acceptable for many applications
- **Speed:** Python implementation slower; use C++/CUDA for production
## Benchmarking
Run benchmarks yourself:
```bash
# Memory compression analysis
python benchmarks/benchmark_memory.py
# Performance comparison
python benchmarks/benchmark_performance.py
```
## Comparison with Other Methods
| Method | Bits/Weight | Compression | Accuracy | Implementation |
|--------|-------------|-------------|----------|----------------|
| Float32 | 32 | 1x | Baseline | Standard |
| Float16 | 16 | 2x | ~Baseline | Standard |
| INT8 | 8 | 4x | High | Quantization |
| **BitLinear** | **1.58** | **~19x** | **Good** | **Ternary** |
## References
- **BitNet Paper:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
- **JMLR Paper:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)
## Reproducing Results
All benchmarks were run on:
- CPU: AMD Ryzen 9 9950x3d
- GPU: RTX 5090
- PyTorch: 2.9.1+cpu
- Python: 3.13
- CUDA: 12.5
Results may vary based on hardware and PyTorch version.