Quick Start Guide
Get up and running with BitLinear in minutes.
Installation
Prerequisites
- Python >= 3.8
- PyTorch >= 2.0.0
- (Optional) CUDA toolkit for GPU acceleration
Install from Source
# Clone the repository
git clone https://github.com/yourusername/bitlinear.git
cd bitlinear
# Install in development mode (CPU-only)
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"
Install with CUDA Support
# Set CUDA_HOME if not already set
export CUDA_HOME=/usr/local/cuda # Linux/Mac
# or
set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8 # Windows
# Install
pip install -e .
Basic Usage
Simple Example
import torch
from bitlinear import BitLinear
# Create a BitLinear layer (same interface as nn.Linear)
layer = BitLinear(in_features=512, out_features=1024, bias=True)
# Forward pass
x = torch.randn(32, 128, 512) # [batch, seq_len, features]
output = layer(x) # [32, 128, 1024]
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
Convert Existing Model
import torch.nn as nn
from bitlinear import BitLinear
# Start with a standard Linear layer
linear = nn.Linear(512, 1024)
# ... possibly pre-trained ...
# Convert to BitLinear
bitlinear = BitLinear.from_linear(linear)
# Use as drop-in replacement
x = torch.randn(16, 512)
output = bitlinear(x)
Multi-Component Ternary Layer
For better approximation quality:
from bitlinear import MultiTernaryLinear
# k=4 means 4 ternary components (better approximation, 4x compute)
layer = MultiTernaryLinear(
in_features=512,
out_features=1024,
k=4, # Number of ternary components
bias=True
)
x = torch.randn(32, 512)
output = layer(x)
Convert Entire Model
from bitlinear import convert_linear_to_bitlinear
import torch.nn as nn
# Original model with nn.Linear layers
model = nn.Sequential(
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 512),
nn.Softmax(dim=-1)
)
# Convert all Linear layers to BitLinear
model_bitlinear = convert_linear_to_bitlinear(model, inplace=False)
# Use as normal
x = torch.randn(16, 512)
output = model_bitlinear(x)
In a Transformer
Replace attention projection layers:
import torch.nn as nn
from bitlinear import BitLinear
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, nhead=8):
super().__init__()
# Replace nn.Linear with BitLinear
self.q_proj = BitLinear(d_model, d_model)
self.k_proj = BitLinear(d_model, d_model)
self.v_proj = BitLinear(d_model, d_model)
self.out_proj = BitLinear(d_model, d_model)
# Keep other components unchanged
self.norm = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(0.1)
def forward(self, x):
# Standard Transformer forward pass
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)
# ... attention computation ...
Memory Savings Example
import torch
import torch.nn as nn
from bitlinear import BitLinear
def count_params(model):
return sum(p.numel() for p in model.parameters())
def estimate_memory_mb(model):
total_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
return total_bytes / (1024 ** 2)
# Standard Linear
linear = nn.Linear(2048, 2048)
print(f"Linear parameters: {count_params(linear):,}")
print(f"Linear memory: {estimate_memory_mb(linear):.2f} MB")
# BitLinear
bitlinear = BitLinear(2048, 2048)
print(f"BitLinear parameters: {count_params(bitlinear):,}")
print(f"BitLinear memory: {estimate_memory_mb(bitlinear):.2f} MB")
# Savings
savings = (estimate_memory_mb(linear) - estimate_memory_mb(bitlinear)) / estimate_memory_mb(linear) * 100
print(f"Memory savings: {savings:.1f}%")
Training with BitLinear
Fine-tuning a Pre-trained Model
import torch
import torch.nn as nn
import torch.optim as optim
from bitlinear import convert_linear_to_bitlinear
# Load pre-trained model
model = YourModel.from_pretrained('model_name')
# Convert to BitLinear
model = convert_linear_to_bitlinear(model, inplace=True)
# Fine-tune with standard PyTorch training loop
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
for batch in dataloader:
x, y = batch
# Forward pass
output = model(x)
loss = criterion(output, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
Quantization-Aware Training (QAT)
Train with quantization from scratch:
from bitlinear import BitLinear
# Model with BitLinear from the start
model = nn.Sequential(
BitLinear(784, 512),
nn.ReLU(),
BitLinear(512, 256),
nn.ReLU(),
BitLinear(256, 10),
)
# Standard training loop
# Gradients will flow through quantization (straight-through estimator)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# ... train as usual ...
Testing
Run the test suite:
# Install test dependencies
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_layers.py -v
# Run with coverage
pytest tests/ -v --cov=bitlinear --cov-report=html
# Skip slow tests
pytest tests/ -m "not slow"
# Skip CUDA tests (if no GPU available)
pytest tests/ -m "not cuda"
Examples
Run included examples:
# Basic usage
python examples/basic_usage.py
# Transformer example
python examples/transformer_example.py
Troubleshooting
Import Error
If you get ModuleNotFoundError: No module named 'bitlinear':
# Make sure you installed the package
pip install -e .
# Or add to PYTHONPATH
export PYTHONPATH=/path/to/BitLinear:$PYTHONPATH
CUDA Build Failures
If CUDA build fails:
Check CUDA_HOME:
echo $CUDA_HOME # Should point to CUDA installationCheck PyTorch CUDA version:
import torch print(torch.version.cuda)Match CUDA versions: PyTorch and system CUDA should match
Fall back to CPU:
# Build CPU-only version unset CUDA_HOME pip install -e .
Tests Failing
All tests are currently marked as pytest.skip() because implementation is not yet complete. This is expected!
To implement:
- Follow
IMPLEMENTATION_GUIDE.md - Start with
bitlinear/quantization.py - Remove
pytest.skip()as you implement each function - Tests should pass as you complete implementation
Next Steps
- Read the Implementation Guide:
IMPLEMENTATION_GUIDE.md - Explore the Project Structure:
PROJECT_STRUCTURE.md - Start Implementing:
- Begin with
bitlinear/quantization.py - Move to
bitlinear/functional.py - Then
bitlinear/layers.py
- Begin with
- Test as You Go: Run tests after implementing each component
- Try Examples: Test with
examples/transformer_example.py
Getting Help
- Documentation: Check docstrings in each module
- Issues: Open an issue on GitHub
- Examples: See
examples/directory - Tests: Look at
tests/for usage patterns
Performance Tips
Memory Optimization
Use packed weights (when implemented):
from bitlinear.packing import pack_ternary_base3 packed, shape = pack_ternary_base3(W_ternary)Batch processing: Larger batches are more efficient
Mixed precision: Combine with torch.amp for activation quantization
Speed Optimization
- Use CUDA: Build with CUDA support for GPU acceleration
- Larger layers: BitLinear benefits increase with layer size
- Profile: Use PyTorch profiler to find bottlenecks
import torch.profiler as profiler
with profiler.profile() as prof:
output = model(x)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Resources
- Paper: https://jmlr.org/papers/volume26/24-2050/24-2050.pdf
- BitNet: https://arxiv.org/abs/2310.11453
- PyTorch Quantization: https://pytorch.org/docs/stable/quantization.html
Happy coding! 🚀