# Implementation Guide

This document provides a roadmap for implementing the BitLinear functionality, following the structure defined in the project skeleton. This is here to give insight on how one can replicate this process to different operations. 

## Implementation Order

### Phase 1: Python Baseline (Correctness First)

Start here to establish correctness before optimizing.

#### 1.1 Quantization (`bitlinear/quantization.py`)

Order of implementation:
1. `absmax_scale()` - Simple max computation
2. `ternary_quantize()` - Threshold-based quantization to {-1, 0, +1}
3. `weight_to_ternary()` - Combines the above
4. Test thoroughly with `tests/test_quantization.py`

**Key considerations:**
- Threshold selection (try 0.33 * scale or 0.5 * scale)
- Per-channel vs. global scaling trade-offs
- Numerical stability (avoid division by zero)

#### 1.2 Functional Operations (`bitlinear/functional.py`)

Order of implementation:
1. `bitlinear_python()` - Core ternary matmul
   ```python
   # Pseudocode:
   output = torch.matmul(x, W_ternary.T)
   output = output * gamma.unsqueeze(0)
   if bias is not None:
       output = output + bias
   return output
   ```

2. `greedy_ternary_decomposition()` - Iterative residual quantization
   ```python
   # Pseudocode:
   residual = W.clone()
   for i in range(k):
       W_t, gamma = weight_to_ternary(residual)
       store W_t and gamma
       residual = residual - gamma * W_t
   ```

3. `multi_ternary_linear_python()` - Sum of k ternary operations

4. Test with `tests/test_functional.py`

#### 1.3 Layer Modules (`bitlinear/layers.py`)

Order of implementation:
1. `BitLinear.__init__()` and `reset_parameters()`
   - Initialize dense weights using kaiming_uniform
   - Quantize to ternary using `weight_to_ternary()`
   - Store as buffers or parameters

2. `BitLinear.forward()` - Call `bitlinear_python()`

3. `BitLinear.from_linear()` - Conversion utility

4. `MultiTernaryLinear` - Similar structure

5. `convert_linear_to_bitlinear()` - Recursive module conversion

6. Test with `tests/test_layers.py`

**Testing strategy:**
- Compare output shapes with nn.Linear
- Verify ternary weight values
- Test conversion from pre-trained weights
- Validate in Transformer example

### Phase 2: Memory Optimization

#### 2.1 Base-3 Packing (`bitlinear/packing.py`)

Implement packing for memory efficiency:
1. `pack_ternary_base3()` - 5 values per byte
2. `unpack_ternary_base3()` - Reverse operation
3. Verify roundtrip: pack → unpack == identity

**Packing scheme:**
```
Map: -1 → 0, 0 → 1, +1 → 2 (base-3 digits)
Pack 5 digits per byte: d0 + d1*3 + d2*9 + d3*27 + d4*81
```

### Phase 3: C++ Extensions (Optional but Recommended)

#### 3.1 CPU Implementation (`bitlinear/cpp/bitlinear.cpp`)

1. Implement `bitlinear_cpu_forward()`
   - Basic matrix multiplication with ternary weights
   - Exploit ternary structure (skip multiplications)

2. Implement `multi_ternary_cpu_forward()`

3. Test integration with Python

**Optimization opportunities (later):**
- AVX/AVX512 vectorization
- OpenMP parallelization
- Cache-efficient tiling

#### 3.2 CUDA Kernels (`bitlinear/cpp/bitlinear_kernel.cu`)

Only after CPU version works!

1. Basic kernel without optimization
   - Thread per output element
   - Simple accumulation

2. Optimized kernel:
   - Shared memory tiling
   - Warp-level reductions
   - Memory coalescing
   - Exploit ternary (conditional accumulation)

3. Advanced (optional):
   - Tensor Core utilization
   - Mixed precision
   - Fused kernels (activation quantization + matmul)

**Performance targets:**
- Should be faster than PyTorch's F.linear for large matrices
- Aim for 2-5x speedup from ternary optimization

### Phase 4: Training Support

#### 4.1 Quantization-Aware Training (QAT)

Modify layers to support gradient flow:
1. Straight-through estimator for ternary quantization
2. Learnable scaling factors (gamma)
3. Fine-tuning pre-trained models

#### 4.2 Initialization Strategies

Experiment with initialization for ternary weights:
- Standard kaiming_uniform then quantize
- Specialized initialization for ternary
- Better threshold selection

## Testing Strategy

### Unit Tests
Run frequently during development:
```bash
pytest tests/test_quantization.py -v
pytest tests/test_functional.py -v
pytest tests/test_layers.py -v
```

### Integration Tests
Test full pipelines:
1. Dense model → quantization → inference
2. Transformer with BitLinear layers
3. Save/load model checkpoints

### Numerical Correctness
Compare with reference:
```python
# Create same layer in dense and ternary
linear = nn.Linear(512, 512)
bitlinear = BitLinear.from_linear(linear)

x = torch.randn(32, 512)
out_dense = linear(x)
out_ternary = bitlinear(x)

# Should be similar (not identical due to quantization)
error = torch.norm(out_dense - out_ternary) / torch.norm(out_dense)
print(f"Relative error: {error:.4f}")  # Expect ~0.1-0.3
```

## Common Pitfalls

### Quantization
- **Pitfall:** Wrong threshold → too many zeros or not enough
- **Solution:** Start with 0.5 * scale, tune empirically

### Shape Handling
- **Pitfall:** Broadcasting errors with gamma
- **Solution:** Use `.unsqueeze()` carefully, test various input shapes

### CUDA Compilation
- **Pitfall:** CUDA version mismatches
- **Solution:** Match PyTorch's CUDA version, use CPU-only build first

### Gradients
- **Pitfall:** No gradient flow through ternary quantization
- **Solution:** Implement straight-through estimator for QAT

## Performance Benchmarks

Create benchmarks to track progress:
```python
import time
import torch
from bitlinear import BitLinear

def benchmark(layer, x, n_runs=100):
    # Warmup
    for _ in range(10):
        _ = layer(x)
    
    # Benchmark
    start = time.time()
    for _ in range(n_runs):
        _ = layer(x)
    end = time.time()
    
    return (end - start) / n_runs

# Compare
linear = nn.Linear(2048, 2048).cuda()
bitlinear = BitLinear(2048, 2048).cuda()
x = torch.randn(128, 2048).cuda()

time_linear = benchmark(linear, x)
time_bitlinear = benchmark(bitlinear, x)

print(f"nn.Linear: {time_linear*1000:.2f} ms")
print(f"BitLinear: {time_bitlinear*1000:.2f} ms")
print(f"Speedup: {time_linear/time_bitlinear:.2f}x")
```

## Next Steps After Skeleton

1. **Implement Phase 1** (Python baseline)
   - Start with `absmax_scale()` and `ternary_quantize()`
   - Test each function as you go
   - Don't move to next phase until tests pass

2. **Validate with Examples**
   - Run `examples/basic_usage.py`
   - Run `examples/transformer_example.py`
   - Check output similarity and memory savings

3. **Optimize if Needed**
   - Profile to find bottlenecks
   - Implement C++/CUDA only after Python works
   - Measure performance improvements

4. **Documentation**
   - Add docstring details from implementation
   - Create API documentation
   - Write usage tutorials

## Resources

### Papers
- BitNet: https://arxiv.org/abs/2310.11453
- Ternary Neural Networks: https://jmlr.org/papers/volume26/24-2050/24-2050.pdf

### PyTorch Resources
- Custom Extensions: https://pytorch.org/tutorials/advanced/cpp_extension.html
- CUDA Programming: https://pytorch.org/tutorials/advanced/custom_ops.html

### Quantization
- QAT Guide: https://pytorch.org/docs/stable/quantization.html
- Straight-through Estimator: Bengio et al., 2013

## Questions to Consider

As you implement, think about:
1. **Memory vs. Speed:** Packed weights save memory but need unpacking
2. **Training vs. Inference:** Different requirements for gradients
3. **Compatibility:** Should work with existing PyTorch features (DDP, AMP, etc.)
4. **Extensibility:** Easy to add new quantization schemes?

Good luck with implementation! Start with correctness, then optimize.