BitLinear / read /IMPLEMENTATION_GUIDE.md

krisaujla

Upload folder using huggingface_hub

fd8c8b9 verified 5 days ago

preview code

raw

history blame contribute delete

8.01 kB

Implementation Guide

This document provides a roadmap for implementing the BitLinear functionality, following the structure defined in the project skeleton. This is here to give insight on how one can replicate this process to different operations.

Implementation Order

Phase 1: Python Baseline (Correctness First)

Start here to establish correctness before optimizing.

1.1 Quantization (`bitlinear/quantization.py`)

Order of implementation:

absmax_scale() - Simple max computation
ternary_quantize() - Threshold-based quantization to {-1, 0, +1}
weight_to_ternary() - Combines the above
Test thoroughly with tests/test_quantization.py

Key considerations:

Threshold selection (try 0.33 * scale or 0.5 * scale)
Per-channel vs. global scaling trade-offs
Numerical stability (avoid division by zero)

1.2 Functional Operations (`bitlinear/functional.py`)

Order of implementation:

bitlinear_python() - Core ternary matmul

# Pseudocode:
output = torch.matmul(x, W_ternary.T)
output = output * gamma.unsqueeze(0)
if bias is not None:
    output = output + bias
return output

greedy_ternary_decomposition() - Iterative residual quantization

# Pseudocode:
residual = W.clone()
for i in range(k):
    W_t, gamma = weight_to_ternary(residual)
    store W_t and gamma
    residual = residual - gamma * W_t

multi_ternary_linear_python() - Sum of k ternary operations
Test with tests/test_functional.py

1.3 Layer Modules (`bitlinear/layers.py`)

Order of implementation:

BitLinear.__init__() and reset_parameters()
- Initialize dense weights using kaiming_uniform
- Quantize to ternary using weight_to_ternary()
- Store as buffers or parameters
BitLinear.forward() - Call bitlinear_python()
BitLinear.from_linear() - Conversion utility
MultiTernaryLinear - Similar structure
convert_linear_to_bitlinear() - Recursive module conversion
Test with tests/test_layers.py

Testing strategy:

Compare output shapes with nn.Linear
Verify ternary weight values
Test conversion from pre-trained weights
Validate in Transformer example

Phase 2: Memory Optimization

2.1 Base-3 Packing (`bitlinear/packing.py`)

Implement packing for memory efficiency:

pack_ternary_base3() - 5 values per byte
unpack_ternary_base3() - Reverse operation
Verify roundtrip: pack → unpack == identity

Packing scheme:

Map: -1 → 0, 0 → 1, +1 → 2 (base-3 digits)
Pack 5 digits per byte: d0 + d1*3 + d2*9 + d3*27 + d4*81

Phase 3: C++ Extensions (Optional but Recommended)

3.1 CPU Implementation (`bitlinear/cpp/bitlinear.cpp`)

Implement bitlinear_cpu_forward()
- Basic matrix multiplication with ternary weights
- Exploit ternary structure (skip multiplications)
Implement multi_ternary_cpu_forward()
Test integration with Python

Optimization opportunities (later):

AVX/AVX512 vectorization
OpenMP parallelization
Cache-efficient tiling

3.2 CUDA Kernels (`bitlinear/cpp/bitlinear_kernel.cu`)

Only after CPU version works!

Basic kernel without optimization
- Thread per output element
- Simple accumulation
Optimized kernel:
- Shared memory tiling
- Warp-level reductions
- Memory coalescing
- Exploit ternary (conditional accumulation)
Advanced (optional):
- Tensor Core utilization
- Mixed precision
- Fused kernels (activation quantization + matmul)

Performance targets:

Should be faster than PyTorch's F.linear for large matrices
Aim for 2-5x speedup from ternary optimization

Phase 4: Training Support

4.1 Quantization-Aware Training (QAT)

Modify layers to support gradient flow:

Straight-through estimator for ternary quantization
Learnable scaling factors (gamma)
Fine-tuning pre-trained models

4.2 Initialization Strategies

Experiment with initialization for ternary weights:

Standard kaiming_uniform then quantize
Specialized initialization for ternary
Better threshold selection

Testing Strategy

Unit Tests

Run frequently during development:

pytest tests/test_quantization.py -v
pytest tests/test_functional.py -v
pytest tests/test_layers.py -v

Integration Tests

Test full pipelines:

Dense model → quantization → inference
Transformer with BitLinear layers
Save/load model checkpoints

Numerical Correctness

Compare with reference:

# Create same layer in dense and ternary
linear = nn.Linear(512, 512)
bitlinear = BitLinear.from_linear(linear)

x = torch.randn(32, 512)
out_dense = linear(x)
out_ternary = bitlinear(x)

# Should be similar (not identical due to quantization)
error = torch.norm(out_dense - out_ternary) / torch.norm(out_dense)
print(f"Relative error: {error:.4f}")  # Expect ~0.1-0.3

Common Pitfalls

Quantization

Pitfall: Wrong threshold → too many zeros or not enough
Solution: Start with 0.5 * scale, tune empirically

Shape Handling

Pitfall: Broadcasting errors with gamma
Solution: Use .unsqueeze() carefully, test various input shapes

CUDA Compilation

Pitfall: CUDA version mismatches
Solution: Match PyTorch's CUDA version, use CPU-only build first

Gradients

Pitfall: No gradient flow through ternary quantization
Solution: Implement straight-through estimator for QAT

Performance Benchmarks

Create benchmarks to track progress:

import time
import torch
from bitlinear import BitLinear

def benchmark(layer, x, n_runs=100):
    # Warmup
    for _ in range(10):
        _ = layer(x)
    
    # Benchmark
    start = time.time()
    for _ in range(n_runs):
        _ = layer(x)
    end = time.time()
    
    return (end - start) / n_runs

# Compare
linear = nn.Linear(2048, 2048).cuda()
bitlinear = BitLinear(2048, 2048).cuda()
x = torch.randn(128, 2048).cuda()

time_linear = benchmark(linear, x)
time_bitlinear = benchmark(bitlinear, x)

print(f"nn.Linear: {time_linear*1000:.2f} ms")
print(f"BitLinear: {time_bitlinear*1000:.2f} ms")
print(f"Speedup: {time_linear/time_bitlinear:.2f}x")

Next Steps After Skeleton

Implement Phase 1 (Python baseline)
- Start with absmax_scale() and ternary_quantize()
- Test each function as you go
- Don't move to next phase until tests pass
Validate with Examples
- Run examples/basic_usage.py
- Run examples/transformer_example.py
- Check output similarity and memory savings
Optimize if Needed
- Profile to find bottlenecks
- Implement C++/CUDA only after Python works
- Measure performance improvements
Documentation
- Add docstring details from implementation
- Create API documentation
- Write usage tutorials

Resources

Papers

BitNet: https://arxiv.org/abs/2310.11453
Ternary Neural Networks: https://jmlr.org/papers/volume26/24-2050/24-2050.pdf

PyTorch Resources

Custom Extensions: https://pytorch.org/tutorials/advanced/cpp_extension.html
CUDA Programming: https://pytorch.org/tutorials/advanced/custom_ops.html

Quantization

QAT Guide: https://pytorch.org/docs/stable/quantization.html
Straight-through Estimator: Bengio et al., 2013

Questions to Consider

As you implement, think about:

Memory vs. Speed: Packed weights save memory but need unpacking
Training vs. Inference: Different requirements for gradients
Compatibility: Should work with existing PyTorch features (DDP, AMP, etc.)
Extensibility: Easy to add new quantization schemes?

Good luck with implementation! Start with correctness, then optimize.

Implementation Guide

Implementation Order

Phase 1: Python Baseline (Correctness First)

1.1 Quantization (bitlinear/quantization.py)

1.2 Functional Operations (bitlinear/functional.py)

1.3 Layer Modules (bitlinear/layers.py)

Phase 2: Memory Optimization

2.1 Base-3 Packing (bitlinear/packing.py)

Phase 3: C++ Extensions (Optional but Recommended)

3.1 CPU Implementation (bitlinear/cpp/bitlinear.cpp)

3.2 CUDA Kernels (bitlinear/cpp/bitlinear_kernel.cu)

Phase 4: Training Support

4.1 Quantization-Aware Training (QAT)

4.2 Initialization Strategies

Testing Strategy

Unit Tests

Integration Tests

Numerical Correctness

Common Pitfalls

Quantization

Shape Handling

CUDA Compilation

Gradients

Performance Benchmarks

Next Steps After Skeleton

Resources

Papers

PyTorch Resources

Quantization

Questions to Consider

1.1 Quantization (`bitlinear/quantization.py`)

1.2 Functional Operations (`bitlinear/functional.py`)

1.3 Layer Modules (`bitlinear/layers.py`)

2.1 Base-3 Packing (`bitlinear/packing.py`)

3.1 CPU Implementation (`bitlinear/cpp/bitlinear.cpp`)

3.2 CUDA Kernels (`bitlinear/cpp/bitlinear_kernel.cu`)