--- language: - en license: mit library_name: pytorch tags: - quantization - model-compression - bitnet - ternary-networks - deep-learning - pytorch - cuda - cpp - edge-ai - efficient-ml - low-precision - transformer pipeline_tag: other --- # BitLinear: Ultra-Low-Precision Linear Layers for PyTorch [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/) A production-ready PyTorch implementation of **1.58-bit ternary linear layers** that achieves **~19x memory compression** while maintaining high accuracy. Drop-in replacement for `nn.Linear` with optimized C++/CUDA kernels. ## Key Features - **19.3x Memory Compression** - Near-theoretical maximum (20x) - **Drop-in Replacement** - Same API as `nn.Linear` - **Optimized Kernels** - C++ CPU and CUDA GPU implementations - **Research-Grade** - Based on BitNet and JMLR ternary networks papers - **Production Ready** - Fully tested with comprehensive benchmarks ## ๐Ÿ“Š Performance Highlights ### Memory Compression Achieves **19.23x average compression** across various layer sizes: | Layer Size | nn.Linear | BitLinear (Packed) | Compression | |------------|-----------|-------------------|-------------| | 512ร—512 | 1.00 MB | 0.05 MB | **18.6x** | | 1024ร—1024 | 4.00 MB | 0.21 MB | **19.3x** | | 4096ร—4096 | 64.02 MB | 3.23 MB | **19.8x** | ### Real-World Example: GPT-2 Small Converting a GPT-2 Small model (12 layers, d_model=768, d_ff=3072): - **Original:** 324 MB - **BitLinear:** 16.8 MB - **Saved:** 307 MB (19.3x compression) ### Accuracy Maintains high output similarity despite extreme quantization: - **Cosine Similarity:** 96.3% - **Relative Error:** ~28% - **Multi-Ternary (k=3):** 75% error reduction vs k=1 See [BENCHMARKS.md](BENCHMARKS.md) for detailed performance analysis. ## ๐Ÿš€ Quick Start ### Installation ```bash # CPU-only build pip install -e . # With CUDA support (requires CUDA toolkit) CUDA_HOME=/usr/local/cuda pip install -e . ``` ### Basic Usage ```python import torch from bitlinear import BitLinear # Create a BitLinear layer (same interface as nn.Linear) layer = BitLinear(in_features=512, out_features=1024, bias=True) # Forward pass x = torch.randn(32, 128, 512) output = layer(x) # Same as nn.Linear! print(f"Weight values: {torch.unique(layer.W_ternary)}") # [-1, 0, 1] ``` ### Converting Existing Models ```python import torch.nn as nn from bitlinear import convert_linear_to_bitlinear # Convert a pre-trained model model = nn.TransformerEncoderLayer(d_model=512, nhead=8) model_compressed = convert_linear_to_bitlinear(model, inplace=False) # Use as normal - all Linear layers are now BitLinear x = torch.randn(10, 32, 512) output = model_compressed(x) ``` ### Multi-Ternary for Better Accuracy ```python from bitlinear import MultiTernaryLinear # Use k=3 components for 75% error reduction layer = MultiTernaryLinear(in_features=512, out_features=1024, k=3) ``` ## ๐Ÿ“– How It Works BitLinear uses **ternary quantization** to represent weights with only three values: {-1, 0, +1}. ### Architecture 1. **Quantization:** Weights quantized to {-1, 0, +1} using absmax scaling 2. **Scaling:** Per-output-channel scaling factors (gamma) compensate for quantization 3. **Packing:** Base-3 encoding stores 5 ternary values per byte 4. **Computation:** Optimized kernels exploit ternary structure (no multiplications needed) ### Memory Efficiency - **Theoretical:** logโ‚‚(3) โ‰ˆ 1.58 bits per weight - **Actual:** 1.6 bits per weight (5 values per byte) - **Efficiency:** 98.8% of theoretical maximum ## ๐Ÿ“ Project Structure ``` BitLinear/ โ”œโ”€โ”€ bitlinear/ # Main package โ”‚ โ”œโ”€โ”€ layers.py # BitLinear and MultiTernaryLinear modules โ”‚ โ”œโ”€โ”€ functional.py # Core functional implementations โ”‚ โ”œโ”€โ”€ quantization.py # Ternary quantization utilities โ”‚ โ”œโ”€โ”€ packing.py # Base-3 packing for memory efficiency โ”‚ โ””โ”€โ”€ cpp/ # C++/CUDA extensions โ”‚ โ”œโ”€โ”€ bitlinear.cpp # PyBind11 bindings & CPU kernels โ”‚ โ””โ”€โ”€ bitlinear_kernel.cu # CUDA GPU kernels โ”œโ”€โ”€ tests/ # Comprehensive test suite โ”œโ”€โ”€ examples/ # Usage examples โ”‚ โ”œโ”€โ”€ basic_usage.py # Simple demonstrations โ”‚ โ””โ”€โ”€ transformer_example.py # Transformer integration โ”œโ”€โ”€ benchmarks/ # Performance benchmarks โ”‚ โ”œโ”€โ”€ benchmark_memory.py # Memory analysis โ”‚ โ””โ”€โ”€ benchmark_performance.py # Speed comparison โ””โ”€โ”€ notebooks/ # Interactive tutorials โ””โ”€โ”€ demo.md # Step-by-step guide ``` ## ๐Ÿงช Examples ### Example 1: Basic Layer ```python from bitlinear import BitLinear, estimate_memory_savings # Create layer layer = BitLinear(512, 1024) # Check memory savings stats = estimate_memory_savings(512, 1024) print(f"Compression: {stats['compression_ratio']:.1f}x") # ~19x ``` ### Example 2: Transformer Conversion ```python from bitlinear import convert_linear_to_bitlinear # Original transformer model = nn.TransformerEncoderLayer(d_model=768, nhead=8, dim_feedforward=3072) # Convert to BitLinear model_bit = convert_linear_to_bitlinear(model) # Compare memory mem_original = sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2 mem_bitlinear = sum(p.numel() * p.element_size() for p in model_bit.parameters()) / 1024**2 print(f"Memory: {mem_original:.2f} MB โ†’ {mem_bitlinear:.2f} MB") ``` Run complete examples: ```bash python examples/basic_usage.py python examples/transformer_example.py ``` ## ๐Ÿ“ˆ Benchmarks Run benchmarks to see performance on your hardware: ```bash # Memory compression analysis python benchmarks/benchmark_memory.py # Forward pass performance python benchmarks/benchmark_performance.py ``` ## ๐Ÿงช Testing Comprehensive test suite with 60+ tests: ```bash # Run all tests pytest tests/ -v # Run specific test modules pytest tests/test_quantization.py -v pytest tests/test_layers.py -v ``` ## ๐ŸŽ“ Research Background This implementation is based on: - **BitNet:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453) - **JMLR:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf) ### Key Innovations 1. **Ternary Quantization:** Reduces weights to {-1, 0, +1} 2. **Absmax Scaling:** Per-channel scaling for accuracy 3. **Greedy Decomposition:** Multi-ternary for better approximation 4. **Base-3 Packing:** Near-optimal memory compression ## ๐Ÿ› ๏ธ Implementation Details ### Python Baseline Pure PyTorch implementation for correctness and clarity: - `bitlinear_python()` - Reference ternary matmul - `greedy_ternary_decomposition()` - Multi-component quantization - Full gradient support for training ### C++ Extensions Optimized CPU kernels with PyBind11: - Ternary-specific optimizations (no multiplications) - Efficient memory access patterns - Base-3 packing/unpacking ### CUDA Kernels GPU-accelerated implementation: - Warp-level reductions using shuffle intrinsics - Shared memory tiling - Memory coalescing - Fused multi-ternary kernels ## ๐ŸŽฏ Use Cases ### Ideal For: - **Edge Deployment:** Mobile and embedded devices - **Large Models:** Billion-parameter models with memory constraints - **Production Inference:** Cost-effective serving at scale - **Research:** Exploring ultra-low-precision networks ### Considerations: - **Training:** Best results with quantization-aware training (QAT) - **Accuracy:** 3-5% accuracy drop typical (acceptable for many tasks) - **Speed:** Python implementation may be slower; use C++/CUDA for production ## ๐Ÿ“š Documentation - **[BENCHMARKS.md](BENCHMARKS.md)** - Detailed performance analysis - **[MODEL_CARD.md](MODEL_CARD.md)** - HuggingFace model card - **[notebooks/demo.md](notebooks/demo.md)** - Interactive tutorial - **[read/IMPLEMENTATION_GUIDE.md](read/IMPLEMENTATION_GUIDE.md)** - Implementation details (Note can release if needed. Working on extending the pipeline to support future Machine Learning Research) ## ๐Ÿค Contributing Contributions welcome! Areas for improvement: - AVX/AVX512 vectorization for CPU - Tensor Core utilization for CUDA - Additional quantization schemes - Training examples and tutorials ## ๐Ÿ“„ License MIT License - see [LICENSE](LICENSE) file for details. ## ๐Ÿ“– Citation If you use BitLinear in your research, please cite: ```bibtex @article{jmlr_ternary_2024, title={Ternary Representations of Neural Networks}, journal={Journal of Machine Learning Research}, volume={26}, year={2024}, url={https://jmlr.org/papers/volume26/24-2050/24-2050.pdf} } @article{bitnet2023, title={BitNet: Scaling 1-bit Transformers for Large Language Models}, author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Wang, Huaijie and Ma, Lingxiao and Yang, Fan and Wang, Ruiping and Wu, Yi and Wei, Furu}, journal={arXiv preprint arXiv:2310.11453}, year={2023} } ``` ## ๐ŸŒŸ Acknowledgments This implementation builds upon the groundbreaking work in: - BitNet by Microsoft Research - Ternary Neural Networks research (JMLR) - PyTorch's extensibility framework ## ๐Ÿ“ž Contact For questions, issues, or collaboration: - Open an issue on GitHub - Check existing documentation - Review examples and benchmarks --- Please tag me if you use this in anything you build. I would love to see what you build with it. Made with โค๏ธ for efficient deep learning