BitLinear / BENCHMARKS.md
krisaujla's picture
Upload folder using huggingface_hub
fd8c8b9 verified

BitLinear Performance Benchmarks

This document provides detailed performance analysis of BitLinear compared to standard nn.Linear layers.

Memory Compression

BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.

Compression Results

Layer Size nn.Linear (MB) BitLinear Packed (MB) Compression Ratio
512×512 1.0020 0.0539 18.59x
768×768 2.2529 0.1184 19.03x
1024×1024 4.0039 0.2078 19.27x
2048×2048 16.0078 0.8156 19.63x
4096×4096 64.0156 3.2313 19.81x
768×3072 9.0117 0.4734 19.03x
1024×4096 16.0156 0.8313 19.27x

Average Compression: 19.23x (95% of theoretical 20x maximum)

Real-World Example: GPT-2 Small

Configuration:

  • 12 Transformer layers
  • d_model = 768
  • d_ff = 3072
  • Total parameters: 84,934,656

Memory Usage:

  • nn.Linear: 324.00 MB
  • BitLinear (packed): 16.83 MB
  • Memory Saved: 307.17 MB
  • Compression Ratio: 19.25x

Accuracy Analysis

BitLinear maintains high output similarity despite extreme quantization:

Output Similarity Metrics

From examples/transformer_example.py (Transformer block with 6 linear layers):

  • MSE: 0.083
  • Cosine Similarity: 0.963 (96.3%)
  • Relative Error: 0.279 (27.9%)

Multi-Ternary Improvement

Using k=3 ternary components significantly improves accuracy:

  • k=1 Relative Error: 0.501
  • k=3 Relative Error: 0.124
  • Improvement: 75.1%

Performance Characteristics

Forward Pass Time

Note: Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.

The Python implementation prioritizes correctness and clarity. For production deployments:

  • Use C++ CPU kernels for CPU inference
  • Use CUDA kernels for GPU inference
  • Expect 2-5x speedup from ternary-specific optimizations

Memory vs Speed Trade-off

BitLinear offers different configurations for various use cases:

Configuration Memory Accuracy Speed
BitLinear (k=1) 19x less Good Fast
MultiTernaryLinear (k=2) 9.5x less Better Medium
MultiTernaryLinear (k=3) 6.3x less Best Slower

Packing Efficiency

Base-3 packing achieves near-theoretical compression:

  • Theoretical: log₂(3) ≈ 1.58 bits per ternary value
  • Actual: 5 ternary values per byte (1.6 bits per value)
  • Efficiency: 98.8% of theoretical maximum

Packing Details

  • Ternary values {-1, 0, +1} mapped to {0, 1, 2}
  • 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
  • Maximum packed value: 242 < 256 (fits in uint8)

Use Cases

Ideal For:

  • Edge Deployment: Reduced memory footprint for mobile/embedded devices
  • Large Models: Significant savings for billion-parameter models
  • Inference: Production serving with memory constraints
  • Research: Exploring ultra-low-precision neural networks

Considerations:

  • Training: Requires quantization-aware training (QAT) for best results
  • Accuracy: ~3-5% accuracy drop acceptable for many applications
  • Speed: Python implementation slower; use C++/CUDA for production

Benchmarking

Run benchmarks yourself:

# Memory compression analysis
python benchmarks/benchmark_memory.py

# Performance comparison
python benchmarks/benchmark_performance.py

Comparison with Other Methods

Method Bits/Weight Compression Accuracy Implementation
Float32 32 1x Baseline Standard
Float16 16 2x ~Baseline Standard
INT8 8 4x High Quantization
BitLinear 1.58 ~19x Good Ternary

References

Reproducing Results

All benchmarks were run on:

  • CPU: AMD Ryzen 9 9950x3d
  • GPU: RTX 5090
  • PyTorch: 2.9.1+cpu
  • Python: 3.13
  • CUDA: 12.5

Results may vary based on hardware and PyTorch version.