BENCHMARKS.md · krisaujla/BitLinear at main

File size: 4,686 Bytes

fd8c8b9

# BitLinear Performance Benchmarks

This document provides detailed performance analysis of BitLinear compared to standard `nn.Linear` layers.

## Memory Compression

BitLinear achieves near-optimal memory compression through ternary weight quantization and base-3 packing.

### Compression Results

| Layer Size | nn.Linear (MB) | BitLinear Packed (MB) | Compression Ratio |
|------------|----------------|----------------------|-------------------|
| 512×512    | 1.0020         | 0.0539              | 18.59x            |
| 768×768    | 2.2529         | 0.1184              | 19.03x            |
| 1024×1024  | 4.0039         | 0.2078              | 19.27x            |
| 2048×2048  | 16.0078        | 0.8156              | 19.63x            |
| 4096×4096  | 64.0156        | 3.2313              | 19.81x            |
| 768×3072   | 9.0117         | 0.4734              | 19.03x            |
| 1024×4096  | 16.0156        | 0.8313              | 19.27x            |

**Average Compression:** 19.23x (95% of theoretical 20x maximum)

### Real-World Example: GPT-2 Small

Configuration:
- 12 Transformer layers
- d_model = 768

- d_ff = 3072
- Total parameters: 84,934,656

Memory Usage:
- **nn.Linear:** 324.00 MB
- **BitLinear (packed):** 16.83 MB
- **Memory Saved:** 307.17 MB
- **Compression Ratio:** 19.25x

## Accuracy Analysis

BitLinear maintains high output similarity despite extreme quantization:

### Output Similarity Metrics

From `examples/transformer_example.py` (Transformer block with 6 linear layers):

- **MSE:** 0.083
- **Cosine Similarity:** 0.963 (96.3%)
- **Relative Error:** 0.279 (27.9%)

### Multi-Ternary Improvement

Using k=3 ternary components significantly improves accuracy:

- **k=1 Relative Error:** 0.501
- **k=3 Relative Error:** 0.124
- **Improvement:** 75.1%

## Performance Characteristics

### Forward Pass Time

> **Note:** Current Python implementation may be slower than nn.Linear. C++/CUDA extensions provide optimized kernels for production use.

The Python implementation prioritizes correctness and clarity. For production deployments:
- Use C++ CPU kernels for CPU inference
- Use CUDA kernels for GPU inference
- Expect 2-5x speedup from ternary-specific optimizations

### Memory vs Speed Trade-off

BitLinear offers different configurations for various use cases:

| Configuration | Memory | Accuracy | Speed |
|--------------|--------|----------|-------|
| BitLinear (k=1) | 19x less | Good | Fast |
| MultiTernaryLinear (k=2) | 9.5x less | Better | Medium |
| MultiTernaryLinear (k=3) | 6.3x less | Best | Slower |

## Packing Efficiency

Base-3 packing achieves near-theoretical compression:

- **Theoretical:** log₂(3) ≈ 1.58 bits per ternary value
- **Actual:** 5 ternary values per byte (1.6 bits per value)
- **Efficiency:** 98.8% of theoretical maximum

### Packing Details

- Ternary values {-1, 0, +1} mapped to {0, 1, 2}
- 5 values packed per byte: d₀ + 3d₁ + 9d₂ + 27d₃ + 81d₄
- Maximum packed value: 242 < 256 (fits in uint8)

## Use Cases

### Ideal For:
- **Edge Deployment:** Reduced memory footprint for mobile/embedded devices
- **Large Models:** Significant savings for billion-parameter models
- **Inference:** Production serving with memory constraints
- **Research:** Exploring ultra-low-precision neural networks

### Considerations:
- **Training:** Requires quantization-aware training (QAT) for best results
- **Accuracy:** ~3-5% accuracy drop acceptable for many applications
- **Speed:** Python implementation slower; use C++/CUDA for production

## Benchmarking

Run benchmarks yourself:

```bash

# Memory compression analysis

python benchmarks/benchmark_memory.py



# Performance comparison

python benchmarks/benchmark_performance.py

```

## Comparison with Other Methods

| Method | Bits/Weight | Compression | Accuracy | Implementation |
|--------|-------------|-------------|----------|----------------|
| Float32 | 32 | 1x | Baseline | Standard |
| Float16 | 16 | 2x | ~Baseline | Standard |
| INT8 | 8 | 4x | High | Quantization |
| **BitLinear** | **1.58** | **~19x** | **Good** | **Ternary** |

## References

- **BitNet Paper:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
- **JMLR Paper:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)

## Reproducing Results

All benchmarks were run on:
- CPU: AMD Ryzen 9 9950x3d
- GPU: RTX 5090
- PyTorch: 2.9.1+cpu
- Python: 3.13
- CUDA: 12.5

Results may vary based on hardware and PyTorch version.