| # Quick Start Guide | |
| Get up and running with BitLinear in minutes. | |
| ## Installation | |
| ### Prerequisites | |
| - Python >= 3.8 | |
| - PyTorch >= 2.0.0 | |
| - (Optional) CUDA toolkit for GPU acceleration | |
| ### Install from Source | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/yourusername/bitlinear.git | |
| cd bitlinear | |
| # Install in development mode (CPU-only) | |
| pip install -e . | |
| # Or with development dependencies | |
| pip install -e ".[dev]" | |
| ``` | |
| ### Install with CUDA Support | |
| ```bash | |
| # Set CUDA_HOME if not already set | |
| export CUDA_HOME=/usr/local/cuda # Linux/Mac | |
| # or | |
| set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8 # Windows | |
| # Install | |
| pip install -e . | |
| ``` | |
| ## Basic Usage | |
| ### Simple Example | |
| ```python | |
| import torch | |
| from bitlinear import BitLinear | |
| # Create a BitLinear layer (same interface as nn.Linear) | |
| layer = BitLinear(in_features=512, out_features=1024, bias=True) | |
| # Forward pass | |
| x = torch.randn(32, 128, 512) # [batch, seq_len, features] | |
| output = layer(x) # [32, 128, 1024] | |
| print(f"Input shape: {x.shape}") | |
| print(f"Output shape: {output.shape}") | |
| ``` | |
| ### Convert Existing Model | |
| ```python | |
| import torch.nn as nn | |
| from bitlinear import BitLinear | |
| # Start with a standard Linear layer | |
| linear = nn.Linear(512, 1024) | |
| # ... possibly pre-trained ... | |
| # Convert to BitLinear | |
| bitlinear = BitLinear.from_linear(linear) | |
| # Use as drop-in replacement | |
| x = torch.randn(16, 512) | |
| output = bitlinear(x) | |
| ``` | |
| ### Multi-Component Ternary Layer | |
| For better approximation quality: | |
| ```python | |
| from bitlinear import MultiTernaryLinear | |
| # k=4 means 4 ternary components (better approximation, 4x compute) | |
| layer = MultiTernaryLinear( | |
| in_features=512, | |
| out_features=1024, | |
| k=4, # Number of ternary components | |
| bias=True | |
| ) | |
| x = torch.randn(32, 512) | |
| output = layer(x) | |
| ``` | |
| ### Convert Entire Model | |
| ```python | |
| from bitlinear import convert_linear_to_bitlinear | |
| import torch.nn as nn | |
| # Original model with nn.Linear layers | |
| model = nn.Sequential( | |
| nn.Linear(512, 1024), | |
| nn.ReLU(), | |
| nn.Linear(1024, 512), | |
| nn.Softmax(dim=-1) | |
| ) | |
| # Convert all Linear layers to BitLinear | |
| model_bitlinear = convert_linear_to_bitlinear(model, inplace=False) | |
| # Use as normal | |
| x = torch.randn(16, 512) | |
| output = model_bitlinear(x) | |
| ``` | |
| ## In a Transformer | |
| Replace attention projection layers: | |
| ```python | |
| import torch.nn as nn | |
| from bitlinear import BitLinear | |
| class TransformerBlock(nn.Module): | |
| def __init__(self, d_model=512, nhead=8): | |
| super().__init__() | |
| # Replace nn.Linear with BitLinear | |
| self.q_proj = BitLinear(d_model, d_model) | |
| self.k_proj = BitLinear(d_model, d_model) | |
| self.v_proj = BitLinear(d_model, d_model) | |
| self.out_proj = BitLinear(d_model, d_model) | |
| # Keep other components unchanged | |
| self.norm = nn.LayerNorm(d_model) | |
| self.dropout = nn.Dropout(0.1) | |
| def forward(self, x): | |
| # Standard Transformer forward pass | |
| q = self.q_proj(x) | |
| k = self.k_proj(x) | |
| v = self.v_proj(x) | |
| # ... attention computation ... | |
| ``` | |
| ## Memory Savings Example | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| from bitlinear import BitLinear | |
| def count_params(model): | |
| return sum(p.numel() for p in model.parameters()) | |
| def estimate_memory_mb(model): | |
| total_bytes = sum(p.numel() * p.element_size() for p in model.parameters()) | |
| return total_bytes / (1024 ** 2) | |
| # Standard Linear | |
| linear = nn.Linear(2048, 2048) | |
| print(f"Linear parameters: {count_params(linear):,}") | |
| print(f"Linear memory: {estimate_memory_mb(linear):.2f} MB") | |
| # BitLinear | |
| bitlinear = BitLinear(2048, 2048) | |
| print(f"BitLinear parameters: {count_params(bitlinear):,}") | |
| print(f"BitLinear memory: {estimate_memory_mb(bitlinear):.2f} MB") | |
| # Savings | |
| savings = (estimate_memory_mb(linear) - estimate_memory_mb(bitlinear)) / estimate_memory_mb(linear) * 100 | |
| print(f"Memory savings: {savings:.1f}%") | |
| ``` | |
| ## Training with BitLinear | |
| ### Fine-tuning a Pre-trained Model | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| import torch.optim as optim | |
| from bitlinear import convert_linear_to_bitlinear | |
| # Load pre-trained model | |
| model = YourModel.from_pretrained('model_name') | |
| # Convert to BitLinear | |
| model = convert_linear_to_bitlinear(model, inplace=True) | |
| # Fine-tune with standard PyTorch training loop | |
| optimizer = optim.Adam(model.parameters(), lr=1e-4) | |
| criterion = nn.CrossEntropyLoss() | |
| for epoch in range(num_epochs): | |
| for batch in dataloader: | |
| x, y = batch | |
| # Forward pass | |
| output = model(x) | |
| loss = criterion(output, y) | |
| # Backward pass | |
| optimizer.zero_grad() | |
| loss.backward() | |
| optimizer.step() | |
| ``` | |
| ### Quantization-Aware Training (QAT) | |
| Train with quantization from scratch: | |
| ```python | |
| from bitlinear import BitLinear | |
| # Model with BitLinear from the start | |
| model = nn.Sequential( | |
| BitLinear(784, 512), | |
| nn.ReLU(), | |
| BitLinear(512, 256), | |
| nn.ReLU(), | |
| BitLinear(256, 10), | |
| ) | |
| # Standard training loop | |
| # Gradients will flow through quantization (straight-through estimator) | |
| optimizer = optim.Adam(model.parameters(), lr=1e-3) | |
| # ... train as usual ... | |
| ``` | |
| ## Testing | |
| Run the test suite: | |
| ```bash | |
| # Install test dependencies | |
| pip install -e ".[dev]" | |
| # Run all tests | |
| pytest tests/ -v | |
| # Run specific test file | |
| pytest tests/test_layers.py -v | |
| # Run with coverage | |
| pytest tests/ -v --cov=bitlinear --cov-report=html | |
| # Skip slow tests | |
| pytest tests/ -m "not slow" | |
| # Skip CUDA tests (if no GPU available) | |
| pytest tests/ -m "not cuda" | |
| ``` | |
| ## Examples | |
| Run included examples: | |
| ```bash | |
| # Basic usage | |
| python examples/basic_usage.py | |
| # Transformer example | |
| python examples/transformer_example.py | |
| ``` | |
| ## Troubleshooting | |
| ### Import Error | |
| If you get `ModuleNotFoundError: No module named 'bitlinear'`: | |
| ```bash | |
| # Make sure you installed the package | |
| pip install -e . | |
| # Or add to PYTHONPATH | |
| export PYTHONPATH=/path/to/BitLinear:$PYTHONPATH | |
| ``` | |
| ### CUDA Build Failures | |
| If CUDA build fails: | |
| 1. **Check CUDA_HOME:** | |
| ```bash | |
| echo $CUDA_HOME # Should point to CUDA installation | |
| ``` | |
| 2. **Check PyTorch CUDA version:** | |
| ```python | |
| import torch | |
| print(torch.version.cuda) | |
| ``` | |
| 3. **Match CUDA versions:** PyTorch and system CUDA should match | |
| 4. **Fall back to CPU:** | |
| ```bash | |
| # Build CPU-only version | |
| unset CUDA_HOME | |
| pip install -e . | |
| ``` | |
| ### Tests Failing | |
| All tests are currently marked as `pytest.skip()` because implementation is not yet complete. This is expected! | |
| To implement: | |
| 1. Follow `IMPLEMENTATION_GUIDE.md` | |
| 2. Start with `bitlinear/quantization.py` | |
| 3. Remove `pytest.skip()` as you implement each function | |
| 4. Tests should pass as you complete implementation | |
| ## Next Steps | |
| 1. **Read the Implementation Guide:** `IMPLEMENTATION_GUIDE.md` | |
| 2. **Explore the Project Structure:** `PROJECT_STRUCTURE.md` | |
| 3. **Start Implementing:** | |
| - Begin with `bitlinear/quantization.py` | |
| - Move to `bitlinear/functional.py` | |
| - Then `bitlinear/layers.py` | |
| 4. **Test as You Go:** Run tests after implementing each component | |
| 5. **Try Examples:** Test with `examples/transformer_example.py` | |
| ## Getting Help | |
| - **Documentation:** Check docstrings in each module | |
| - **Issues:** Open an issue on GitHub | |
| - **Examples:** See `examples/` directory | |
| - **Tests:** Look at `tests/` for usage patterns | |
| ## Performance Tips | |
| ### Memory Optimization | |
| 1. **Use packed weights** (when implemented): | |
| ```python | |
| from bitlinear.packing import pack_ternary_base3 | |
| packed, shape = pack_ternary_base3(W_ternary) | |
| ``` | |
| 2. **Batch processing:** Larger batches are more efficient | |
| 3. **Mixed precision:** Combine with torch.amp for activation quantization | |
| ### Speed Optimization | |
| 1. **Use CUDA:** Build with CUDA support for GPU acceleration | |
| 2. **Larger layers:** BitLinear benefits increase with layer size | |
| 3. **Profile:** Use PyTorch profiler to find bottlenecks | |
| ```python | |
| import torch.profiler as profiler | |
| with profiler.profile() as prof: | |
| output = model(x) | |
| print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) | |
| ``` | |
| ## Resources | |
| - **Paper:** https://jmlr.org/papers/volume26/24-2050/24-2050.pdf | |
| - **BitNet:** https://arxiv.org/abs/2310.11453 | |
| - **PyTorch Quantization:** https://pytorch.org/docs/stable/quantization.html | |
| Happy coding! 🚀 | |