README.md · krisaujla/BitLinear at main

File size: 10,130 Bytes

---

language:
- en
license: mit
library_name: pytorch
tags:
- quantization
- model-compression
- bitnet
- ternary-networks
- deep-learning
- pytorch
- cuda
- cpp
- edge-ai
- efficient-ml
- low-precision
- transformer
pipeline_tag: other
---


# BitLinear: Ultra-Low-Precision Linear Layers for PyTorch

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)

A production-ready PyTorch implementation of **1.58-bit ternary linear layers** that achieves **~19x memory compression** while maintaining high accuracy. Drop-in replacement for `nn.Linear` with optimized C++/CUDA kernels.

##  Key Features

- **19.3x Memory Compression** - Near-theoretical maximum (20x)
- **Drop-in Replacement** - Same API as `nn.Linear`
- **Optimized Kernels** - C++ CPU and CUDA GPU implementations
- **Research-Grade** - Based on BitNet and JMLR ternary networks papers
- **Production Ready** - Fully tested with comprehensive benchmarks

## 📊 Performance Highlights

### Memory Compression

Achieves **19.23x average compression** across various layer sizes:

| Layer Size | nn.Linear | BitLinear (Packed) | Compression |
|------------|-----------|-------------------|-------------|
| 512×512    | 1.00 MB   | 0.05 MB          | **18.6x**   |
| 1024×1024  | 4.00 MB   | 0.21 MB          | **19.3x**   |
| 4096×4096  | 64.02 MB  | 3.23 MB          | **19.8x**   |

### Real-World Example: GPT-2 Small

Converting a GPT-2 Small model (12 layers, d_model=768, d_ff=3072):

- **Original:** 324 MB
- **BitLinear:** 16.8 MB
- **Saved:** 307 MB (19.3x compression)

### Accuracy

Maintains high output similarity despite extreme quantization:

- **Cosine Similarity:** 96.3%
- **Relative Error:** ~28%
- **Multi-Ternary (k=3):** 75% error reduction vs k=1

See [BENCHMARKS.md](BENCHMARKS.md) for detailed performance analysis.

## 🚀 Quick Start

### Installation

```bash

# CPU-only build

pip install -e .



# With CUDA support (requires CUDA toolkit)

CUDA_HOME=/usr/local/cuda pip install -e .

```

### Basic Usage

```python

import torch

from bitlinear import BitLinear



# Create a BitLinear layer (same interface as nn.Linear)

layer = BitLinear(in_features=512, out_features=1024, bias=True)



# Forward pass

x = torch.randn(32, 128, 512)

output = layer(x)  # Same as nn.Linear!



print(f"Weight values: {torch.unique(layer.W_ternary)}")  # [-1, 0, 1]

```

### Converting Existing Models

```python

import torch.nn as nn

from bitlinear import convert_linear_to_bitlinear



# Convert a pre-trained model

model = nn.TransformerEncoderLayer(d_model=512, nhead=8)

model_compressed = convert_linear_to_bitlinear(model, inplace=False)



# Use as normal - all Linear layers are now BitLinear

x = torch.randn(10, 32, 512)

output = model_compressed(x)

```

### Multi-Ternary for Better Accuracy

```python

from bitlinear import MultiTernaryLinear



# Use k=3 components for 75% error reduction

layer = MultiTernaryLinear(in_features=512, out_features=1024, k=3)

```

## 📖 How It Works

BitLinear uses **ternary quantization** to represent weights with only three values: {-1, 0, +1}.

### Architecture

1. **Quantization:** Weights quantized to {-1, 0, +1} using absmax scaling
2. **Scaling:** Per-output-channel scaling factors (gamma) compensate for quantization
3. **Packing:** Base-3 encoding stores 5 ternary values per byte
4. **Computation:** Optimized kernels exploit ternary structure (no multiplications needed)

### Memory Efficiency

- **Theoretical:** log₂(3) ≈ 1.58 bits per weight
- **Actual:** 1.6 bits per weight (5 values per byte)
- **Efficiency:** 98.8% of theoretical maximum

## 📁 Project Structure

```

BitLinear/

├── bitlinear/              # Main package

│   ├── layers.py           # BitLinear and MultiTernaryLinear modules

│   ├── functional.py       # Core functional implementations

│   ├── quantization.py     # Ternary quantization utilities

│   ├── packing.py          # Base-3 packing for memory efficiency

│   └── cpp/                # C++/CUDA extensions

│       ├── bitlinear.cpp   # PyBind11 bindings & CPU kernels

│       └── bitlinear_kernel.cu  # CUDA GPU kernels

├── tests/                  # Comprehensive test suite

├── examples/               # Usage examples

│   ├── basic_usage.py      # Simple demonstrations

│   └── transformer_example.py  # Transformer integration

├── benchmarks/             # Performance benchmarks

│   ├── benchmark_memory.py     # Memory analysis

│   └── benchmark_performance.py # Speed comparison

└── notebooks/              # Interactive tutorials

    └── demo.md             # Step-by-step guide

```

## 🧪 Examples

### Example 1: Basic Layer

```python

from bitlinear import BitLinear, estimate_memory_savings



# Create layer

layer = BitLinear(512, 1024)



# Check memory savings

stats = estimate_memory_savings(512, 1024)

print(f"Compression: {stats['compression_ratio']:.1f}x")  # ~19x

```

### Example 2: Transformer Conversion

```python

from bitlinear import convert_linear_to_bitlinear



# Original transformer

model = nn.TransformerEncoderLayer(d_model=768, nhead=8, dim_feedforward=3072)



# Convert to BitLinear

model_bit = convert_linear_to_bitlinear(model)



# Compare memory

mem_original = sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2

mem_bitlinear = sum(p.numel() * p.element_size() for p in model_bit.parameters()) / 1024**2

print(f"Memory: {mem_original:.2f} MB → {mem_bitlinear:.2f} MB")

```

Run complete examples:

```bash

python examples/basic_usage.py

python examples/transformer_example.py

```

## 📈 Benchmarks

Run benchmarks to see performance on your hardware:

```bash

# Memory compression analysis

python benchmarks/benchmark_memory.py



# Forward pass performance

python benchmarks/benchmark_performance.py

```

## 🧪 Testing

Comprehensive test suite with 60+ tests:

```bash

# Run all tests

pytest tests/ -v



# Run specific test modules

pytest tests/test_quantization.py -v

pytest tests/test_layers.py -v

```

## 🎓 Research Background

This implementation is based on:

- **BitNet:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
- **JMLR:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)

### Key Innovations

1. **Ternary Quantization:** Reduces weights to {-1, 0, +1}
2. **Absmax Scaling:** Per-channel scaling for accuracy
3. **Greedy Decomposition:** Multi-ternary for better approximation
4. **Base-3 Packing:** Near-optimal memory compression

## 🛠️ Implementation Details

### Python Baseline

Pure PyTorch implementation for correctness and clarity:
- `bitlinear_python()` - Reference ternary matmul
- `greedy_ternary_decomposition()` - Multi-component quantization
- Full gradient support for training

### C++ Extensions

Optimized CPU kernels with PyBind11:
- Ternary-specific optimizations (no multiplications)
- Efficient memory access patterns
- Base-3 packing/unpacking

### CUDA Kernels

GPU-accelerated implementation:
- Warp-level reductions using shuffle intrinsics
- Shared memory tiling
- Memory coalescing
- Fused multi-ternary kernels

## 🎯 Use Cases

### Ideal For:

- **Edge Deployment:** Mobile and embedded devices
- **Large Models:** Billion-parameter models with memory constraints
- **Production Inference:** Cost-effective serving at scale
- **Research:** Exploring ultra-low-precision networks

### Considerations:

- **Training:** Best results with quantization-aware training (QAT)
- **Accuracy:** 3-5% accuracy drop typical (acceptable for many tasks)
- **Speed:** Python implementation may be slower; use C++/CUDA for production

## 📚 Documentation

- **[BENCHMARKS.md](BENCHMARKS.md)** - Detailed performance analysis
- **[MODEL_CARD.md](MODEL_CARD.md)** - HuggingFace model card
- **[notebooks/demo.md](notebooks/demo.md)** - Interactive tutorial
- **[read/IMPLEMENTATION_GUIDE.md](read/IMPLEMENTATION_GUIDE.md)** - Implementation details (Note can release if needed. Working on extending the pipeline to support future Machine Learning Research)

## 🤝 Contributing

Contributions welcome! Areas for improvement:

- AVX/AVX512 vectorization for CPU
- Tensor Core utilization for CUDA
- Additional quantization schemes
- Training examples and tutorials

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

## 📖 Citation

If you use BitLinear in your research, please cite:

```bibtex

@article{jmlr_ternary_2024,

  title={Ternary Representations of Neural Networks},

  journal={Journal of Machine Learning Research},

  volume={26},

  year={2024},

  url={https://jmlr.org/papers/volume26/24-2050/24-2050.pdf}

}



@article{bitnet2023,

  title={BitNet: Scaling 1-bit Transformers for Large Language Models},

  author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Wang, Huaijie and Ma, Lingxiao and Yang, Fan and Wang, Ruiping and Wu, Yi and Wei, Furu},

  journal={arXiv preprint arXiv:2310.11453},

  year={2023}

}

```

## 🌟 Acknowledgments

This implementation builds upon the groundbreaking work in:
- BitNet by Microsoft Research
- Ternary Neural Networks research (JMLR)
- PyTorch's extensibility framework

## 📞 Contact

For questions, issues, or collaboration:
- Open an issue on GitHub
- Check existing documentation
- Review examples and benchmarks

---

Please tag me if you use this in anything you build. I would love to see what you build with it.

Made with ❤️ for efficient deep learning