# BitLinear Project Structure Complete directory tree and file descriptions. ``` BitLinear/ │ ├── README.md # Project overview and quick start ├── LICENSE # MIT License ├── setup.py # Build system with torch.utils.cpp_extension ├── pyproject.toml # Tool configurations (pytest, black, mypy) ├── requirements.txt # Core dependencies ├── requirements-dev.txt # Development dependencies ├── .gitignore # Git ignore rules ├── IMPLEMENTATION_GUIDE.md # Step-by-step implementation roadmap │ ├── bitlinear/ # Main package │ ├── __init__.py # Package exports │ ├── layers.py # BitLinear and MultiTernaryLinear modules │ ├── functional.py # Core functional implementations │ ├── quantization.py # Ternary quantization utilities │ ├── packing.py # Base-3 packing for memory efficiency │ │ │ └── cpp/ # C++/CUDA extensions │ ├── bitlinear.cpp # PyBind11 bindings and CPU implementation │ └── bitlinear_kernel.cu # CUDA kernel implementations │ ├── tests/ # Test suite │ ├── __init__.py │ ├── test_functional.py # Tests for functional API │ ├── test_layers.py # Tests for layer modules │ └── test_quantization.py # Tests for quantization and packing │ └── examples/ # Usage examples ├── basic_usage.py # Simple usage demonstration └── transformer_example.py # Transformer integration example ``` ## File Descriptions ### Root Level - **README.md**: Project overview, installation instructions, quick start guide, and citations - **LICENSE**: MIT License for open-source distribution - **setup.py**: Build configuration using PyTorch's cpp_extension, handles CPU/CUDA builds - **pyproject.toml**: Configuration for pytest, black, mypy, and coverage - **requirements.txt**: Core runtime dependencies (torch, numpy) - **requirements-dev.txt**: Development tools (pytest, black, flake8, mypy) - **.gitignore**: Ignores Python cache, build artifacts, CUDA objects - **IMPLEMENTATION_GUIDE.md**: Detailed implementation roadmap with phases and best practices ### bitlinear/ (Main Package) #### Python Modules - **`__init__.py`**: Package initialization, exports main classes and functions - **`layers.py`**: nn.Module implementations - `BitLinear`: Drop-in replacement for nn.Linear with ternary weights - `MultiTernaryLinear`: Sum of k ternary components - `convert_linear_to_bitlinear()`: Recursive model conversion utility - **`functional.py`**: Core functional implementations - `bitlinear_python()`: Pure PyTorch ternary matmul with scaling - `greedy_ternary_decomposition()`: Iterative residual quantization - `multi_ternary_linear_python()`: Multi-component forward pass - `activation_quant()`: Activation quantization for full BitNet - **`quantization.py`**: Quantization utilities - `absmax_scale()`: Compute absmax scaling factors - `ternary_quantize()`: Quantize to {-1, 0, +1} - `weight_to_ternary()`: Full quantization pipeline - `quantize_activations_absmax()`: 8-bit activation quantization - `dequantize_scale()`: Reverse quantization - **`packing.py`**: Memory optimization - `pack_ternary_base3()`: Pack 5 ternary values per byte - `unpack_ternary_base3()`: Unpack base-3 encoded weights - `compute_compression_ratio()`: Calculate compression statistics - `estimate_memory_savings()`: Memory estimation utilities #### C++/CUDA Extensions - **`cpp/bitlinear.cpp`**: C++ interface - PyBind11 module definition - CPU implementations: `bitlinear_cpu_forward()`, `multi_ternary_cpu_forward()` - Device dispatcher (routes to CPU or CUDA) - Packing utilities in C++ - **`cpp/bitlinear_kernel.cu`**: CUDA kernels - `bitlinear_forward_kernel()`: Optimized ternary matmul kernel - `multi_ternary_forward_kernel()`: Fused multi-component kernel - Kernel launchers with error handling - TODO: Tensor Core optimization ### tests/ Comprehensive test suite using pytest: - **`test_functional.py`**: Tests for functional API - Shape correctness - Numerical correctness vs. nn.Linear - Greedy decomposition quality - Multi-ternary equivalence - **`test_layers.py`**: Tests for layer modules - Initialization and parameter counts - Forward pass shapes - Compatibility with nn.Linear - Conversion utilities - Gradient flow (QAT) - Integration with Transformer blocks - **`test_quantization.py`**: Tests for quantization - Absmax scaling (global and per-channel) - Ternary quantization values and thresholds - Reconstruction quality - Base-3 packing roundtrip - Compression ratios - Memory estimation ### examples/ Demonstration scripts: - **`basic_usage.py`**: Minimal example showing basic API - Creating BitLinear layers - Forward pass - Conversion from nn.Linear - **`transformer_example.py`**: Realistic Transformer example - Complete Transformer block implementation - Conversion to BitLinear - Output comparison - Memory savings calculation ## Key Design Patterns ### 1. Progressive Enhancement - Python baseline → C++ CPU → CUDA GPU - Each layer fully functional before adding next ### 2. Drop-in Compatibility - Same interface as nn.Linear - Same initialization arguments - Same forward signature - Works with existing PyTorch features ### 3. Modular Testing - Unit tests for each component - Integration tests for full pipelines - Performance benchmarks separate ### 4. Extensive Documentation - Docstrings explain mathematical operations - TODO comments mark implementation points - References to papers for algorithms - Type hints for clarity ## Build Targets ### CPU-only (Development) ```bash pip install -e . ``` ### With CUDA (Production) ```bash CUDA_HOME=/usr/local/cuda pip install -e . ``` ### Testing ```bash pip install -e ".[dev]" pytest tests/ -v ``` ## What's NOT Implemented Yet All files are **stubs with TODOs**: - ✅ Structure is complete - ✅ Interfaces are defined - ✅ Documentation is written - ❌ Logic is NOT implemented (by design) - ❌ Tests will skip/fail until implementation ## Next Steps Follow IMPLEMENTATION_GUIDE.md: 1. Start with `quantization.py` (absmax_scale, ternary_quantize) 2. Move to `functional.py` (bitlinear_python) 3. Implement `layers.py` (BitLinear module) 4. Test with examples 5. Add C++/CUDA if needed ## Design Philosophy **Correctness > Speed > Memory** 1. First make it work (Python) 2. Then make it fast (C++/CUDA) 3. Then make it efficient (packing) Every component is: - Well-documented - Testable - Modular - Extensible