BENCHMARKS.md · krisaujla/BitLinear at main

BitLinear / BENCHMARKS.md

krisaujla

Upload folder using huggingface_hub

fd8c8b9 verified 18 days ago

preview code

raw

history blame contribute delete

4.69 kB

	# BitLinear Performance Benchmarks

	This document provides detailed performance analysis of BitLinear compared to standard `nn.Linear` layers.

	## Memory Compression

	BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.

	### Compression Results

	\| Layer Size \| nn.Linear (MB) \| BitLinear Packed (MB) \| Compression Ratio \|
	\|------------\|----------------\|----------------------\|-------------------\|
	\| 512×512 \| 1.0020 \| 0.0539 \| 18.59x \|
	\| 768×768 \| 2.2529 \| 0.1184 \| 19.03x \|
	\| 1024×1024 \| 4.0039 \| 0.2078 \| 19.27x \|
	\| 2048×2048 \| 16.0078 \| 0.8156 \| 19.63x \|
	\| 4096×4096 \| 64.0156 \| 3.2313 \| 19.81x \|
	\| 768×3072 \| 9.0117 \| 0.4734 \| 19.03x \|
	\| 1024×4096 \| 16.0156 \| 0.8313 \| 19.27x \|

	Average Compression: 19.23x (95% of theoretical 20x maximum)

	### Real-World Example: GPT-2 Small

	Configuration:
	- 12 Transformer layers
	- d_model = 768
	- d_ff = 3072
	- Total parameters: 84,934,656

	Memory Usage:
	- nn.Linear: 324.00 MB
	- BitLinear (packed): 16.83 MB
	- Memory Saved: 307.17 MB
	- Compression Ratio: 19.25x

	## Accuracy Analysis

	BitLinear maintains high output similarity despite extreme quantization:

	### Output Similarity Metrics

	From `examples/transformer_example.py` (Transformer block with 6 linear layers):

	- MSE: 0.083
	- Cosine Similarity: 0.963 (96.3%)
	- Relative Error: 0.279 (27.9%)

	### Multi-Ternary Improvement

	Using k=3 ternary components significantly improves accuracy:

	- k=1 Relative Error: 0.501
	- k=3 Relative Error: 0.124
	- Improvement: 75.1%

	## Performance Characteristics

	### Forward Pass Time

	> Note: Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.

	The Python implementation prioritizes correctness and clarity. For production deployments:
	- Use C++ CPU kernels for CPU inference
	- Use CUDA kernels for GPU inference
	- Expect 2-5x speedup from ternary-specific optimizations

	### Memory vs Speed Trade-off

	BitLinear offers different configurations for various use cases:

	\| Configuration \| Memory \| Accuracy \| Speed \|
	\|--------------\|--------\|----------\|-------\|
	\| BitLinear (k=1) \| 19x less \| Good \| Fast \|
	\| MultiTernaryLinear (k=2) \| 9.5x less \| Better \| Medium \|
	\| MultiTernaryLinear (k=3) \| 6.3x less \| Best \| Slower \|

	## Packing Efficiency

	Base-3 packing achieves near-theoretical compression:

	- Theoretical: log₂(3) ≈ 1.58 bits per ternary value
	- Actual: 5 ternary values per byte (1.6 bits per value)
	- Efficiency: 98.8% of theoretical maximum

	### Packing Details

	- Ternary values {-1, 0, +1} mapped to {0, 1, 2}
	- 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
	- Maximum packed value: 242 < 256 (fits in uint8)

	## Use Cases

	### Ideal For:
	- Edge Deployment: Reduced memory footprint for mobile/embedded devices
	- Large Models: Significant savings for billion-parameter models
	- Inference: Production serving with memory constraints
	- Research: Exploring ultra-low-precision neural networks

	### Considerations:
	- Training: Requires quantization-aware training (QAT) for best results
	- Accuracy: ~3-5% accuracy drop acceptable for many applications
	- Speed: Python implementation slower; use C++/CUDA for production

	## Benchmarking

	Run benchmarks yourself:

	```bash
	# Memory compression analysis
	python benchmarks/benchmark_memory.py

	# Performance comparison
	python benchmarks/benchmark_performance.py
	```

	## Comparison with Other Methods

	\| Method \| Bits/Weight \| Compression \| Accuracy \| Implementation \|
	\|--------\|-------------\|-------------\|----------\|----------------\|
	\| Float32 \| 32 \| 1x \| Baseline \| Standard \|
	\| Float16 \| 16 \| 2x \| ~Baseline \| Standard \|
	\| INT8 \| 8 \| 4x \| High \| Quantization \|
	\| BitLinear \| 1.58 \| ~19x \| Good \| Ternary \|

	## References

	- BitNet Paper: [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
	- JMLR Paper: [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)

	## Reproducing Results

	All benchmarks were run on:
	- CPU: AMD Ryzen 9 9950x3d
	- GPU: RTX 5090
	- PyTorch: 2.9.1+cpu
	- Python: 3.13
	- CUDA: 12.5

	Results may vary based on hardware and PyTorch version.