BitLinear / BENCHMARKS.md

krisaujla

Upload folder using huggingface_hub

fd8c8b9 verified 18 days ago

preview code

raw

history blame contribute delete

4.69 kB

BitLinear Performance Benchmarks

This document provides detailed performance analysis of BitLinear compared to standard nn.Linear layers.

Memory Compression

BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.

Compression Results

Layer Size	nn.Linear (MB)	BitLinear Packed (MB)	Compression Ratio
512×512	1.0020	0.0539	18.59x
768×768	2.2529	0.1184	19.03x
1024×1024	4.0039	0.2078	19.27x
2048×2048	16.0078	0.8156	19.63x
4096×4096	64.0156	3.2313	19.81x
768×3072	9.0117	0.4734	19.03x
1024×4096	16.0156	0.8313	19.27x

Average Compression: 19.23x (95% of theoretical 20x maximum)

Real-World Example: GPT-2 Small

Configuration:

12 Transformer layers
d_model = 768
d_ff = 3072
Total parameters: 84,934,656

Memory Usage:

nn.Linear: 324.00 MB
BitLinear (packed): 16.83 MB
Memory Saved: 307.17 MB
Compression Ratio: 19.25x

Accuracy Analysis

BitLinear maintains high output similarity despite extreme quantization:

Output Similarity Metrics

From examples/transformer_example.py (Transformer block with 6 linear layers):

MSE: 0.083
Cosine Similarity: 0.963 (96.3%)
Relative Error: 0.279 (27.9%)

Multi-Ternary Improvement

Using k=3 ternary components significantly improves accuracy:

k=1 Relative Error: 0.501
k=3 Relative Error: 0.124
Improvement: 75.1%

Performance Characteristics

Forward Pass Time

Note: Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.

The Python implementation prioritizes correctness and clarity. For production deployments:

Use C++ CPU kernels for CPU inference
Use CUDA kernels for GPU inference
Expect 2-5x speedup from ternary-specific optimizations

Memory vs Speed Trade-off

BitLinear offers different configurations for various use cases:

Configuration	Memory	Accuracy	Speed
BitLinear (k=1)	19x less	Good	Fast
MultiTernaryLinear (k=2)	9.5x less	Better	Medium
MultiTernaryLinear (k=3)	6.3x less	Best	Slower

Packing Efficiency

Base-3 packing achieves near-theoretical compression:

Theoretical: log₂(3) ≈ 1.58 bits per ternary value
Actual: 5 ternary values per byte (1.6 bits per value)
Efficiency: 98.8% of theoretical maximum

Packing Details

Ternary values {-1, 0, +1} mapped to {0, 1, 2}
5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
Maximum packed value: 242 < 256 (fits in uint8)

Use Cases

Ideal For:

Edge Deployment: Reduced memory footprint for mobile/embedded devices
Large Models: Significant savings for billion-parameter models
Inference: Production serving with memory constraints
Research: Exploring ultra-low-precision neural networks

Considerations:

Training: Requires quantization-aware training (QAT) for best results
Accuracy: ~3-5% accuracy drop acceptable for many applications
Speed: Python implementation slower; use C++/CUDA for production

Benchmarking

Run benchmarks yourself:

# Memory compression analysis
python benchmarks/benchmark_memory.py

# Performance comparison
python benchmarks/benchmark_performance.py

Comparison with Other Methods

Method	Bits/Weight	Compression	Accuracy	Implementation
Float32	32	1x	Baseline	Standard
Float16	16	2x	~Baseline	Standard
INT8	8	4x	High	Quantization
BitLinear	1.58	~19x	Good	Ternary

References

BitNet Paper: Scaling 1-bit Transformers for Large Language Models
JMLR Paper: Ternary Representations of Neural Networks

Reproducing Results

All benchmarks were run on:

CPU: AMD Ryzen 9 9950x3d
GPU: RTX 5090
PyTorch: 2.9.1+cpu
Python: 3.13
CUDA: 12.5

Results may vary based on hardware and PyTorch version.