read/QUICKSTART.md · krisaujla/BitLinear at main

File size: 8,614 Bytes

fd8c8b9

# Quick Start Guide

Get up and running with BitLinear in minutes.

## Installation

### Prerequisites

- Python >= 3.8
- PyTorch >= 2.0.0
- (Optional) CUDA toolkit for GPU acceleration

### Install from Source

```bash

# Clone the repository

git clone https://github.com/yourusername/bitlinear.git

cd bitlinear



# Install in development mode (CPU-only)

pip install -e .



# Or with development dependencies

pip install -e ".[dev]"

```

### Install with CUDA Support

```bash

# Set CUDA_HOME if not already set

export CUDA_HOME=/usr/local/cuda  # Linux/Mac

# or

set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8  # Windows



# Install

pip install -e .

```

## Basic Usage

### Simple Example

```python

import torch

from bitlinear import BitLinear



# Create a BitLinear layer (same interface as nn.Linear)

layer = BitLinear(in_features=512, out_features=1024, bias=True)



# Forward pass

x = torch.randn(32, 128, 512)  # [batch, seq_len, features]

output = layer(x)  # [32, 128, 1024]



print(f"Input shape: {x.shape}")

print(f"Output shape: {output.shape}")

```

### Convert Existing Model

```python

import torch.nn as nn

from bitlinear import BitLinear



# Start with a standard Linear layer

linear = nn.Linear(512, 1024)

# ... possibly pre-trained ...



# Convert to BitLinear

bitlinear = BitLinear.from_linear(linear)



# Use as drop-in replacement

x = torch.randn(16, 512)

output = bitlinear(x)

```

### Multi-Component Ternary Layer

For better approximation quality:

```python

from bitlinear import MultiTernaryLinear



# k=4 means 4 ternary components (better approximation, 4x compute)

layer = MultiTernaryLinear(

    in_features=512,

    out_features=1024,

    k=4,  # Number of ternary components

    bias=True

)



x = torch.randn(32, 512)

output = layer(x)

```

### Convert Entire Model

```python

from bitlinear import convert_linear_to_bitlinear

import torch.nn as nn



# Original model with nn.Linear layers

model = nn.Sequential(

    nn.Linear(512, 1024),

    nn.ReLU(),

    nn.Linear(1024, 512),

    nn.Softmax(dim=-1)

)



# Convert all Linear layers to BitLinear

model_bitlinear = convert_linear_to_bitlinear(model, inplace=False)



# Use as normal

x = torch.randn(16, 512)

output = model_bitlinear(x)

```

## In a Transformer

Replace attention projection layers:

```python

import torch.nn as nn

from bitlinear import BitLinear



class TransformerBlock(nn.Module):

    def __init__(self, d_model=512, nhead=8):

        super().__init__()

        

        # Replace nn.Linear with BitLinear

        self.q_proj = BitLinear(d_model, d_model)

        self.k_proj = BitLinear(d_model, d_model)

        self.v_proj = BitLinear(d_model, d_model)

        self.out_proj = BitLinear(d_model, d_model)

        

        # Keep other components unchanged

        self.norm = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(0.1)

    

    def forward(self, x):

        # Standard Transformer forward pass

        q = self.q_proj(x)

        k = self.k_proj(x)

        v = self.v_proj(x)

        # ... attention computation ...

```

## Memory Savings Example

```python

import torch

import torch.nn as nn

from bitlinear import BitLinear



def count_params(model):

    return sum(p.numel() for p in model.parameters())



def estimate_memory_mb(model):

    total_bytes = sum(p.numel() * p.element_size() for p in model.parameters())

    return total_bytes / (1024 ** 2)



# Standard Linear

linear = nn.Linear(2048, 2048)

print(f"Linear parameters: {count_params(linear):,}")

print(f"Linear memory: {estimate_memory_mb(linear):.2f} MB")



# BitLinear

bitlinear = BitLinear(2048, 2048)

print(f"BitLinear parameters: {count_params(bitlinear):,}")

print(f"BitLinear memory: {estimate_memory_mb(bitlinear):.2f} MB")



# Savings

savings = (estimate_memory_mb(linear) - estimate_memory_mb(bitlinear)) / estimate_memory_mb(linear) * 100

print(f"Memory savings: {savings:.1f}%")

```

## Training with BitLinear

### Fine-tuning a Pre-trained Model

```python

import torch

import torch.nn as nn

import torch.optim as optim

from bitlinear import convert_linear_to_bitlinear



# Load pre-trained model

model = YourModel.from_pretrained('model_name')



# Convert to BitLinear

model = convert_linear_to_bitlinear(model, inplace=True)



# Fine-tune with standard PyTorch training loop

optimizer = optim.Adam(model.parameters(), lr=1e-4)

criterion = nn.CrossEntropyLoss()



for epoch in range(num_epochs):

    for batch in dataloader:

        x, y = batch

        

        # Forward pass

        output = model(x)

        loss = criterion(output, y)

        

        # Backward pass

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

```

### Quantization-Aware Training (QAT)

Train with quantization from scratch:

```python

from bitlinear import BitLinear



# Model with BitLinear from the start

model = nn.Sequential(

    BitLinear(784, 512),

    nn.ReLU(),

    BitLinear(512, 256),

    nn.ReLU(),

    BitLinear(256, 10),

)



# Standard training loop

# Gradients will flow through quantization (straight-through estimator)

optimizer = optim.Adam(model.parameters(), lr=1e-3)

# ... train as usual ...

```

## Testing

Run the test suite:

```bash

# Install test dependencies

pip install -e ".[dev]"



# Run all tests

pytest tests/ -v



# Run specific test file

pytest tests/test_layers.py -v



# Run with coverage

pytest tests/ -v --cov=bitlinear --cov-report=html



# Skip slow tests

pytest tests/ -m "not slow"



# Skip CUDA tests (if no GPU available)

pytest tests/ -m "not cuda"

```

## Examples

Run included examples:

```bash

# Basic usage

python examples/basic_usage.py



# Transformer example

python examples/transformer_example.py

```

## Troubleshooting

### Import Error

If you get `ModuleNotFoundError: No module named 'bitlinear'`:

```bash

# Make sure you installed the package

pip install -e .



# Or add to PYTHONPATH

export PYTHONPATH=/path/to/BitLinear:$PYTHONPATH

```

### CUDA Build Failures

If CUDA build fails:

1. **Check CUDA_HOME:**

   ```bash

   echo $CUDA_HOME  # Should point to CUDA installation

   ```



2. **Check PyTorch CUDA version:**
   ```python

   import torch

   print(torch.version.cuda)

   ```

3. **Match CUDA versions:** PyTorch and system CUDA should match

4. **Fall back to CPU:**
   ```bash

   # Build CPU-only version

   unset CUDA_HOME

   pip install -e .

   ```

### Tests Failing

All tests are currently marked as `pytest.skip()` because implementation is not yet complete. This is expected!

To implement:
1. Follow `IMPLEMENTATION_GUIDE.md`
2. Start with `bitlinear/quantization.py`
3. Remove `pytest.skip()` as you implement each function
4. Tests should pass as you complete implementation

## Next Steps

1. **Read the Implementation Guide:** `IMPLEMENTATION_GUIDE.md`
2. **Explore the Project Structure:** `PROJECT_STRUCTURE.md`
3. **Start Implementing:**
   - Begin with `bitlinear/quantization.py`
   - Move to `bitlinear/functional.py`
   - Then `bitlinear/layers.py`
4. **Test as You Go:** Run tests after implementing each component
5. **Try Examples:** Test with `examples/transformer_example.py`

## Getting Help

- **Documentation:** Check docstrings in each module
- **Issues:** Open an issue on GitHub
- **Examples:** See `examples/` directory
- **Tests:** Look at `tests/` for usage patterns

## Performance Tips

### Memory Optimization

1. **Use packed weights** (when implemented):
   ```python

   from bitlinear.packing import pack_ternary_base3

   packed, shape = pack_ternary_base3(W_ternary)

   ```

2. **Batch processing:** Larger batches are more efficient

3. **Mixed precision:** Combine with torch.amp for activation quantization

### Speed Optimization

1. **Use CUDA:** Build with CUDA support for GPU acceleration
2. **Larger layers:** BitLinear benefits increase with layer size
3. **Profile:** Use PyTorch profiler to find bottlenecks

```python

import torch.profiler as profiler



with profiler.profile() as prof:

    output = model(x)



print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

```

## Resources

- **Paper:** https://jmlr.org/papers/volume26/24-2050/24-2050.pdf
- **BitNet:** https://arxiv.org/abs/2310.11453
- **PyTorch Quantization:** https://pytorch.org/docs/stable/quantization.html

Happy coding! 🚀