Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -7,259 +7,101 @@ tags:
|
|
| 7 |
- ternary
|
| 8 |
- 1.58-bit
|
| 9 |
- phi-4
|
| 10 |
-
-
|
| 11 |
-
- compressed
|
| 12 |
library_name: safetensors
|
| 13 |
pipeline_tag: text-generation
|
| 14 |
language:
|
| 15 |
- en
|
| 16 |
-
- de
|
| 17 |
-
- fr
|
| 18 |
-
- es
|
| 19 |
-
- pt
|
| 20 |
-
- it
|
| 21 |
-
- nl
|
| 22 |
-
- pl
|
| 23 |
-
- ru
|
| 24 |
-
- ja
|
| 25 |
-
- ko
|
| 26 |
-
- zh
|
| 27 |
-
model-index:
|
| 28 |
-
- name: phi-4-bitnet-1.58b
|
| 29 |
-
results: []
|
| 30 |
---
|
| 31 |
|
| 32 |
# Phi-4-BitNet-1.58b
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
[](.)
|
| 40 |
-
[](.)
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
##
|
| 45 |
-
|
| 46 |
-
This repository contains a BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B parameter model. The quantization achieves **6.4x compression** while maintaining model architecture compatibility, reducing storage requirements from 29.32 GB to 4.58 GB.
|
| 47 |
-
|
| 48 |
-
**Key Benefits for Enterprise Deployment:**
|
| 49 |
-
- **84% storage reduction** - Significant infrastructure cost savings
|
| 50 |
-
- **Reduced memory bandwidth** - Faster inference on memory-bound systems
|
| 51 |
-
- **Ternary weights** - Enable specialized hardware acceleration (BitNet kernels)
|
| 52 |
-
- **GGUF format included** - Direct compatibility with llama.cpp and Ollama
|
| 53 |
-
|
| 54 |
-
---
|
| 55 |
-
|
| 56 |
-
## Model Specifications
|
| 57 |
-
|
| 58 |
-
### Quantization Summary
|
| 59 |
|
| 60 |
| Property | Value |
|
| 61 |
|----------|-------|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
|
| 76 |
-
|
|
| 77 |
-
|
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
``
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
```
|
| 99 |
-
Packing Formula: packed_byte = v0 + v1×3 + v2×9 + v3×27
|
| 100 |
-
|
| 101 |
-
Example:
|
| 102 |
-
Values: [-1, 0, +1, -1] → [0, 1, 2, 0]
|
| 103 |
-
Packed: 0 + 1×3 + 2×9 + 0×27 = 21
|
| 104 |
-
|
| 105 |
-
Valid Range: 0-80 (fits in single byte)
|
| 106 |
-
Storage Efficiency: log₂(3⁴) = 6.34 bits for 4 values ≈ 1.58 bits/weight
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
-
### File Structure
|
| 110 |
-
|
| 111 |
-
```
|
| 112 |
-
phi-4-bitnet-1.58b/
|
| 113 |
-
├── model-00001-of-00006.safetensors # Quantized weights (packed U8) + scales (F32)
|
| 114 |
-
├── model-00002-of-00006.safetensors
|
| 115 |
-
├── model-00003-of-00006.safetensors
|
| 116 |
-
├── model-00004-of-00006.safetensors
|
| 117 |
-
├── model-00005-of-00006.safetensors
|
| 118 |
-
├── model-00006-of-00006.safetensors
|
| 119 |
-
├── model.safetensors.index.json # Shard index
|
| 120 |
-
├── phi4-tq2.gguf # GGUF ternary format
|
| 121 |
-
├── config.json # Model configuration
|
| 122 |
-
├── tokenizer.json # HuggingFace tokenizer
|
| 123 |
-
├── tokenizer_config.json
|
| 124 |
-
└── README.md
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
---
|
| 128 |
-
|
| 129 |
-
## Quantization Tooling & Methodology
|
| 130 |
-
|
| 131 |
-
### Tools Used
|
| 132 |
-
|
| 133 |
-
| Component | Tool/Library | Version | Purpose |
|
| 134 |
-
|-----------|--------------|---------|---------|
|
| 135 |
-
| **Quantization Engine** | Custom Rust implementation | - | BitNet 1.58-bit quantization |
|
| 136 |
-
| **Tensor Library** | [Candle](https://github.com/huggingface/candle) | 0.8+ | GPU-accelerated tensor operations |
|
| 137 |
-
| **GGUF Conversion** | [llama.cpp](https://github.com/ggerganov/llama.cpp) | Latest | TQ2_0 ternary GGUF format |
|
| 138 |
-
| **Serialization** | SafeTensors | 0.4+ | Efficient weight storage |
|
| 139 |
-
|
| 140 |
-
### Quantization Process
|
| 141 |
-
|
| 142 |
-
```bash
|
| 143 |
-
# BitNet quantization (custom Rust tool)
|
| 144 |
-
./quantize \
|
| 145 |
-
--model microsoft/phi-4 \
|
| 146 |
-
--method bitnet \
|
| 147 |
-
--group-size 64 \
|
| 148 |
-
--output ./phi4-bitnet-1.58b \
|
| 149 |
-
--format safetensors \
|
| 150 |
-
--streaming enabled
|
| 151 |
-
|
| 152 |
-
# GGUF conversion (llama.cpp)
|
| 153 |
-
python convert_hf_to_gguf.py ./phi4-bitnet-1.58b \
|
| 154 |
-
--outtype tq2_0 \
|
| 155 |
-
--outfile phi4-tq2.gguf
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
### Hardware Requirements
|
| 159 |
-
|
| 160 |
-
**Quantization was performed on:**
|
| 161 |
-
- **GPU**: NVIDIA RTX 5080 (16GB VRAM)
|
| 162 |
-
- **Quantization Time**: ~100 seconds
|
| 163 |
-
- **Processing Rate**: 2.4 tensors/second
|
| 164 |
-
- **Peak Memory**: <8GB VRAM (streaming mode with CPU fallback for large tensors)
|
| 165 |
-
|
| 166 |
-
**Memory Management:**
|
| 167 |
-
- Tensors >3GB automatically fall back to CPU processing
|
| 168 |
-
- Streaming quantization processes one shard at a time
|
| 169 |
-
- Enables quantization of models larger than GPU memory
|
| 170 |
-
|
| 171 |
-
---
|
| 172 |
|
| 173 |
## Usage
|
| 174 |
|
| 175 |
-
### With llama.cpp
|
| 176 |
-
|
| 177 |
```bash
|
| 178 |
-
#
|
| 179 |
-
|
| 180 |
-
ollama run phi4-bitnet
|
| 181 |
-
|
| 182 |
-
# Using llama.cpp directly
|
| 183 |
-
./llama-cli -m phi4-tq2.gguf -p "Write a Python function to sort a list:"
|
| 184 |
```
|
| 185 |
|
| 186 |
-
###
|
| 187 |
-
|
| 188 |
```python
|
| 189 |
-
from safetensors import safe_open
|
| 190 |
-
import numpy as np
|
| 191 |
-
|
| 192 |
def unpack_ternary(packed_byte):
|
| 193 |
-
"""Unpack 4 ternary values from
|
| 194 |
values = []
|
| 195 |
val = packed_byte
|
| 196 |
for _ in range(4):
|
| 197 |
values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1}
|
| 198 |
val //= 3
|
| 199 |
return values
|
| 200 |
-
|
| 201 |
-
# Load quantized weights
|
| 202 |
-
with safe_open("model-00001-of-00006.safetensors", framework="numpy") as f:
|
| 203 |
-
packed_weights = f.get_tensor("model.layers.0.self_attn.q_proj.weight")
|
| 204 |
-
scales = f.get_tensor("model.layers.0.self_attn.q_proj.weight.scales")
|
| 205 |
-
|
| 206 |
-
# Dequantize: unpack and multiply by scales
|
| 207 |
-
# ... implementation specific to inference engine
|
| 208 |
```
|
| 209 |
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
## Performance Considerations
|
| 213 |
-
|
| 214 |
-
### Memory Bandwidth Analysis
|
| 215 |
-
|
| 216 |
-
| Format | Memory Footprint | Relative Bandwidth |
|
| 217 |
-
|--------|------------------|-------------------|
|
| 218 |
-
| BF16 (Original) | 29.32 GB | 1.0x |
|
| 219 |
-
| BitNet 1.58b (This) | 4.58 GB | **6.4x reduction** |
|
| 220 |
-
| GGUF TQ2_0 | 5.57 GB | 5.3x reduction |
|
| 221 |
-
|
| 222 |
-
### Inference Optimization
|
| 223 |
-
|
| 224 |
-
For optimal performance:
|
| 225 |
-
1. **BitNet Kernels**: Use specialized ternary matrix multiplication (no floating-point multiply)
|
| 226 |
-
2. **Memory Mapping**: GGUF format supports mmap for efficient loading
|
| 227 |
-
3. **Batch Processing**: Amortize dequantization overhead across batch dimension
|
| 228 |
-
|
| 229 |
-
---
|
| 230 |
-
|
| 231 |
-
## Limitations & Considerations
|
| 232 |
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
---
|
| 239 |
|
| 240 |
## License
|
| 241 |
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
---
|
| 245 |
|
| 246 |
## Citation
|
| 247 |
|
| 248 |
```bibtex
|
| 249 |
@misc{phi4-bitnet-2025,
|
| 250 |
-
title={Phi-4-BitNet-1.58b:
|
| 251 |
author={Tzervas},
|
| 252 |
year={2025},
|
| 253 |
-
url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b}
|
| 254 |
-
note={BitNet 1.58-bit ternary quantization achieving 6.4x compression}
|
| 255 |
}
|
| 256 |
```
|
| 257 |
-
|
| 258 |
-
---
|
| 259 |
-
|
| 260 |
-
## Acknowledgments
|
| 261 |
-
|
| 262 |
-
- **Microsoft Research** for the Phi-4 base model
|
| 263 |
-
- **BitNet** paper authors for the 1.58-bit quantization methodology
|
| 264 |
-
- **Hugging Face Candle** for the Rust ML framework
|
| 265 |
-
- **llama.cpp** for GGUF format and inference engine
|
|
|
|
| 7 |
- ternary
|
| 8 |
- 1.58-bit
|
| 9 |
- phi-4
|
| 10 |
+
- experimental
|
|
|
|
| 11 |
library_name: safetensors
|
| 12 |
pipeline_tag: text-generation
|
| 13 |
language:
|
| 14 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# Phi-4-BitNet-1.58b
|
| 18 |
|
| 19 |
+
BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B model.
|
| 20 |
|
| 21 |
+
## Overview
|
| 22 |
|
| 23 |
+
This is an **experimental** BitNet 1.58-bit quantization of the Phi-4 model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte.
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
**This is research/experimental work. Quality and performance have not been formally benchmarked.**
|
| 26 |
|
| 27 |
+
## Specifications
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
| Property | Value |
|
| 30 |
|----------|-------|
|
| 31 |
+
| Base Model | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
|
| 32 |
+
| Architecture | Phi-3 (Phi3ForCausalLM) |
|
| 33 |
+
| Parameters | 14B |
|
| 34 |
+
| Quantization | BitNet 1.58-bit ternary |
|
| 35 |
+
| Bits per Weight | ~1.58 |
|
| 36 |
+
| Group Size | 64 |
|
| 37 |
+
| Original Size | 29.32 GB (BF16) |
|
| 38 |
+
| Quantized Size | 4.58 GB (SafeTensors) |
|
| 39 |
+
| GGUF Size | 5.57 GB (TQ2_0) |
|
| 40 |
+
| Compression | ~6.4x |
|
| 41 |
+
|
| 42 |
+
## Formats
|
| 43 |
+
|
| 44 |
+
| Format | File | Description |
|
| 45 |
+
|--------|------|-------------|
|
| 46 |
+
| SafeTensors | `model-*.safetensors` | Sharded quantized weights + scales |
|
| 47 |
+
| GGUF | `phi4-tq2.gguf` | llama.cpp compatible |
|
| 48 |
+
|
| 49 |
+
## Quantization Method
|
| 50 |
+
|
| 51 |
+
### Algorithm
|
| 52 |
+
1. Reshape weights into groups of 64
|
| 53 |
+
2. Compute per-group scale: `scale = mean(|weights|)`
|
| 54 |
+
3. Normalize and round to nearest ternary: `q = round(w / scale)` clamped to {-1, 0, +1}
|
| 55 |
+
4. Map to unsigned: {-1, 0, +1} → {0, 1, 2}
|
| 56 |
+
5. Pack 4 values per byte: `v0 + v1*3 + v2*9 + v3*27`
|
| 57 |
+
|
| 58 |
+
### Tooling
|
| 59 |
+
- **Quantization**: Custom Rust tool using [Candle](https://github.com/huggingface/candle)
|
| 60 |
+
- **GGUF Conversion**: [llama.cpp](https://github.com/ggerganov/llama.cpp) convert_hf_to_gguf.py
|
| 61 |
+
|
| 62 |
+
### Hardware Used
|
| 63 |
+
- GPU: NVIDIA RTX 5080 (16GB VRAM)
|
| 64 |
+
- Quantization time: ~100 seconds
|
| 65 |
+
- Memory: Streaming mode with CPU fallback for large tensors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Usage
|
| 68 |
|
| 69 |
+
### With Ollama/llama.cpp
|
|
|
|
| 70 |
```bash
|
| 71 |
+
# llama.cpp
|
| 72 |
+
./llama-cli -m phi4-tq2.gguf -p "Your prompt here"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
+
### Unpacking Weights (Python)
|
|
|
|
| 76 |
```python
|
|
|
|
|
|
|
|
|
|
| 77 |
def unpack_ternary(packed_byte):
|
| 78 |
+
"""Unpack 4 ternary values from byte."""
|
| 79 |
values = []
|
| 80 |
val = packed_byte
|
| 81 |
for _ in range(4):
|
| 82 |
values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1}
|
| 83 |
val //= 3
|
| 84 |
return values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
```
|
| 86 |
|
| 87 |
+
## Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
- **Quality not benchmarked** - May have significant degradation vs original
|
| 90 |
+
- **Requires custom runtime** - Standard transformers doesn't support ternary weights
|
| 91 |
+
- **Experimental** - Not intended for production use without evaluation
|
| 92 |
+
- GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors
|
|
|
|
|
|
|
| 93 |
|
| 94 |
## License
|
| 95 |
|
| 96 |
+
MIT License (inherited from [microsoft/phi-4](https://huggingface.co/microsoft/phi-4))
|
|
|
|
|
|
|
| 97 |
|
| 98 |
## Citation
|
| 99 |
|
| 100 |
```bibtex
|
| 101 |
@misc{phi4-bitnet-2025,
|
| 102 |
+
title={Phi-4-BitNet-1.58b: Experimental BitNet Quantization of Phi-4},
|
| 103 |
author={Tzervas},
|
| 104 |
year={2025},
|
| 105 |
+
url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b}
|
|
|
|
| 106 |
}
|
| 107 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|