tzervas commited on
Commit
8b58fec
·
verified ·
1 Parent(s): 54ac4d0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +54 -212
README.md CHANGED
@@ -7,259 +7,101 @@ tags:
7
  - ternary
8
  - 1.58-bit
9
  - phi-4
10
- - code-generation
11
- - compressed
12
  library_name: safetensors
13
  pipeline_tag: text-generation
14
  language:
15
  - en
16
- - de
17
- - fr
18
- - es
19
- - pt
20
- - it
21
- - nl
22
- - pl
23
- - ru
24
- - ja
25
- - ko
26
- - zh
27
- model-index:
28
- - name: phi-4-bitnet-1.58b
29
- results: []
30
  ---
31
 
32
  # Phi-4-BitNet-1.58b
33
 
34
- <div align="center">
35
 
36
- **Production-Grade BitNet 1.58-bit Ternary Quantization**
37
 
38
- [![Model Size](https://img.shields.io/badge/Size-4.58GB-blue)](.)
39
- [![Compression](https://img.shields.io/badge/Compression-6.4x-green)](.)
40
- [![Format](https://img.shields.io/badge/Format-SafeTensors%20%7C%20GGUF-orange)](.)
41
 
42
- </div>
43
 
44
- ## Executive Summary
45
-
46
- This repository contains a BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B parameter model. The quantization achieves **6.4x compression** while maintaining model architecture compatibility, reducing storage requirements from 29.32 GB to 4.58 GB.
47
-
48
- **Key Benefits for Enterprise Deployment:**
49
- - **84% storage reduction** - Significant infrastructure cost savings
50
- - **Reduced memory bandwidth** - Faster inference on memory-bound systems
51
- - **Ternary weights** - Enable specialized hardware acceleration (BitNet kernels)
52
- - **GGUF format included** - Direct compatibility with llama.cpp and Ollama
53
-
54
- ---
55
-
56
- ## Model Specifications
57
-
58
- ### Quantization Summary
59
 
60
  | Property | Value |
61
  |----------|-------|
62
- | **Base Model** | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
63
- | **Architecture** | Phi-3 (Phi3ForCausalLM) |
64
- | **Parameters** | 14B |
65
- | **Quantization Method** | BitNet 1.58-bit ({-1, 0, +1}) |
66
- | **Bits per Weight** | 1.58 |
67
- | **Group Size** | 64 |
68
- | **Original Size** | 29.32 GB (BF16) |
69
- | **Quantized Size** | 4.58 GB (SafeTensors) / 5.57 GB (GGUF TQ2_0) |
70
- | **Compression Ratio** | 6.40x |
71
-
72
- ### Available Formats
73
-
74
- | Format | File | Size | Use Case |
75
- |--------|------|------|----------|
76
- | SafeTensors (Sharded) | `model-*.safetensors` | 4.58 GB | Custom BitNet inference engines |
77
- | GGUF TQ2_0 | `phi4-tq2.gguf` | 5.57 GB | llama.cpp, Ollama, LM Studio |
78
-
79
- ---
80
-
81
- ## Technical Implementation
82
-
83
- ### Quantization Algorithm
84
-
85
- The BitNet 1.58-bit quantization implements the following process:
86
-
87
- ```
88
- 1. Group Partitioning: W_grouped = reshape(W, [out_features, in_features/group_size, group_size])
89
- 2. Scale Computation: scale[g] = mean(|W_grouped[:, g, :]|) + eps
90
- 3. Normalization: W_norm = W_grouped / scale
91
- 4. Ternary Quantization: Q = clip(round(W_norm), -1, +1)
92
- 5. Value Mapping: Q_stored = Q + 1 → {-1,0,+1} → {0,1,2}
93
- 6. Bit Packing: 4 ternary values per byte using base-3 encoding
94
- ```
95
-
96
- ### Ternary Encoding Scheme
97
-
98
- ```
99
- Packing Formula: packed_byte = v0 + v1×3 + v2×9 + v3×27
100
-
101
- Example:
102
- Values: [-1, 0, +1, -1] → [0, 1, 2, 0]
103
- Packed: 0 + 1×3 + 2×9 + 0×27 = 21
104
-
105
- Valid Range: 0-80 (fits in single byte)
106
- Storage Efficiency: log₂(3⁴) = 6.34 bits for 4 values ≈ 1.58 bits/weight
107
- ```
108
-
109
- ### File Structure
110
-
111
- ```
112
- phi-4-bitnet-1.58b/
113
- ├── model-00001-of-00006.safetensors # Quantized weights (packed U8) + scales (F32)
114
- ├── model-00002-of-00006.safetensors
115
- ├── model-00003-of-00006.safetensors
116
- ├── model-00004-of-00006.safetensors
117
- ├── model-00005-of-00006.safetensors
118
- ├── model-00006-of-00006.safetensors
119
- ├── model.safetensors.index.json # Shard index
120
- ├── phi4-tq2.gguf # GGUF ternary format
121
- ├── config.json # Model configuration
122
- ├── tokenizer.json # HuggingFace tokenizer
123
- ├── tokenizer_config.json
124
- └── README.md
125
- ```
126
-
127
- ---
128
-
129
- ## Quantization Tooling & Methodology
130
-
131
- ### Tools Used
132
-
133
- | Component | Tool/Library | Version | Purpose |
134
- |-----------|--------------|---------|---------|
135
- | **Quantization Engine** | Custom Rust implementation | - | BitNet 1.58-bit quantization |
136
- | **Tensor Library** | [Candle](https://github.com/huggingface/candle) | 0.8+ | GPU-accelerated tensor operations |
137
- | **GGUF Conversion** | [llama.cpp](https://github.com/ggerganov/llama.cpp) | Latest | TQ2_0 ternary GGUF format |
138
- | **Serialization** | SafeTensors | 0.4+ | Efficient weight storage |
139
-
140
- ### Quantization Process
141
-
142
- ```bash
143
- # BitNet quantization (custom Rust tool)
144
- ./quantize \
145
- --model microsoft/phi-4 \
146
- --method bitnet \
147
- --group-size 64 \
148
- --output ./phi4-bitnet-1.58b \
149
- --format safetensors \
150
- --streaming enabled
151
-
152
- # GGUF conversion (llama.cpp)
153
- python convert_hf_to_gguf.py ./phi4-bitnet-1.58b \
154
- --outtype tq2_0 \
155
- --outfile phi4-tq2.gguf
156
- ```
157
-
158
- ### Hardware Requirements
159
-
160
- **Quantization was performed on:**
161
- - **GPU**: NVIDIA RTX 5080 (16GB VRAM)
162
- - **Quantization Time**: ~100 seconds
163
- - **Processing Rate**: 2.4 tensors/second
164
- - **Peak Memory**: <8GB VRAM (streaming mode with CPU fallback for large tensors)
165
-
166
- **Memory Management:**
167
- - Tensors >3GB automatically fall back to CPU processing
168
- - Streaming quantization processes one shard at a time
169
- - Enables quantization of models larger than GPU memory
170
-
171
- ---
172
 
173
  ## Usage
174
 
175
- ### With llama.cpp / Ollama
176
-
177
  ```bash
178
- # Using Ollama
179
- ollama create phi4-bitnet -f Modelfile
180
- ollama run phi4-bitnet
181
-
182
- # Using llama.cpp directly
183
- ./llama-cli -m phi4-tq2.gguf -p "Write a Python function to sort a list:"
184
  ```
185
 
186
- ### Custom BitNet Inference
187
-
188
  ```python
189
- from safetensors import safe_open
190
- import numpy as np
191
-
192
  def unpack_ternary(packed_byte):
193
- """Unpack 4 ternary values from a byte."""
194
  values = []
195
  val = packed_byte
196
  for _ in range(4):
197
  values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1}
198
  val //= 3
199
  return values
200
-
201
- # Load quantized weights
202
- with safe_open("model-00001-of-00006.safetensors", framework="numpy") as f:
203
- packed_weights = f.get_tensor("model.layers.0.self_attn.q_proj.weight")
204
- scales = f.get_tensor("model.layers.0.self_attn.q_proj.weight.scales")
205
-
206
- # Dequantize: unpack and multiply by scales
207
- # ... implementation specific to inference engine
208
  ```
209
 
210
- ---
211
-
212
- ## Performance Considerations
213
-
214
- ### Memory Bandwidth Analysis
215
-
216
- | Format | Memory Footprint | Relative Bandwidth |
217
- |--------|------------------|-------------------|
218
- | BF16 (Original) | 29.32 GB | 1.0x |
219
- | BitNet 1.58b (This) | 4.58 GB | **6.4x reduction** |
220
- | GGUF TQ2_0 | 5.57 GB | 5.3x reduction |
221
-
222
- ### Inference Optimization
223
-
224
- For optimal performance:
225
- 1. **BitNet Kernels**: Use specialized ternary matrix multiplication (no floating-point multiply)
226
- 2. **Memory Mapping**: GGUF format supports mmap for efficient loading
227
- 3. **Batch Processing**: Amortize dequantization overhead across batch dimension
228
-
229
- ---
230
-
231
- ## Limitations & Considerations
232
 
233
- 1. **Quality Degradation**: Extreme quantization may impact model quality for complex tasks
234
- 2. **Custom Runtime Required**: Standard transformers library doesn't support ternary inference
235
- 3. **Experimental**: Designed for research and edge deployment experimentation
236
- 4. **Evaluation Pending**: Formal benchmark comparisons in progress
237
-
238
- ---
239
 
240
  ## License
241
 
242
- This quantized model inherits the license from the base [Microsoft Phi-4 model](https://huggingface.co/microsoft/phi-4) (MIT License).
243
-
244
- ---
245
 
246
  ## Citation
247
 
248
  ```bibtex
249
  @misc{phi4-bitnet-2025,
250
- title={Phi-4-BitNet-1.58b: Production BitNet Quantization of Phi-4},
251
  author={Tzervas},
252
  year={2025},
253
- url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b},
254
- note={BitNet 1.58-bit ternary quantization achieving 6.4x compression}
255
  }
256
  ```
257
-
258
- ---
259
-
260
- ## Acknowledgments
261
-
262
- - **Microsoft Research** for the Phi-4 base model
263
- - **BitNet** paper authors for the 1.58-bit quantization methodology
264
- - **Hugging Face Candle** for the Rust ML framework
265
- - **llama.cpp** for GGUF format and inference engine
 
7
  - ternary
8
  - 1.58-bit
9
  - phi-4
10
+ - experimental
 
11
  library_name: safetensors
12
  pipeline_tag: text-generation
13
  language:
14
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
  # Phi-4-BitNet-1.58b
18
 
19
+ BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B model.
20
 
21
+ ## Overview
22
 
23
+ This is an **experimental** BitNet 1.58-bit quantization of the Phi-4 model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte.
 
 
24
 
25
+ **This is research/experimental work. Quality and performance have not been formally benchmarked.**
26
 
27
+ ## Specifications
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  | Property | Value |
30
  |----------|-------|
31
+ | Base Model | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
32
+ | Architecture | Phi-3 (Phi3ForCausalLM) |
33
+ | Parameters | 14B |
34
+ | Quantization | BitNet 1.58-bit ternary |
35
+ | Bits per Weight | ~1.58 |
36
+ | Group Size | 64 |
37
+ | Original Size | 29.32 GB (BF16) |
38
+ | Quantized Size | 4.58 GB (SafeTensors) |
39
+ | GGUF Size | 5.57 GB (TQ2_0) |
40
+ | Compression | ~6.4x |
41
+
42
+ ## Formats
43
+
44
+ | Format | File | Description |
45
+ |--------|------|-------------|
46
+ | SafeTensors | `model-*.safetensors` | Sharded quantized weights + scales |
47
+ | GGUF | `phi4-tq2.gguf` | llama.cpp compatible |
48
+
49
+ ## Quantization Method
50
+
51
+ ### Algorithm
52
+ 1. Reshape weights into groups of 64
53
+ 2. Compute per-group scale: `scale = mean(|weights|)`
54
+ 3. Normalize and round to nearest ternary: `q = round(w / scale)` clamped to {-1, 0, +1}
55
+ 4. Map to unsigned: {-1, 0, +1} → {0, 1, 2}
56
+ 5. Pack 4 values per byte: `v0 + v1*3 + v2*9 + v3*27`
57
+
58
+ ### Tooling
59
+ - **Quantization**: Custom Rust tool using [Candle](https://github.com/huggingface/candle)
60
+ - **GGUF Conversion**: [llama.cpp](https://github.com/ggerganov/llama.cpp) convert_hf_to_gguf.py
61
+
62
+ ### Hardware Used
63
+ - GPU: NVIDIA RTX 5080 (16GB VRAM)
64
+ - Quantization time: ~100 seconds
65
+ - Memory: Streaming mode with CPU fallback for large tensors
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Usage
68
 
69
+ ### With Ollama/llama.cpp
 
70
  ```bash
71
+ # llama.cpp
72
+ ./llama-cli -m phi4-tq2.gguf -p "Your prompt here"
 
 
 
 
73
  ```
74
 
75
+ ### Unpacking Weights (Python)
 
76
  ```python
 
 
 
77
  def unpack_ternary(packed_byte):
78
+ """Unpack 4 ternary values from byte."""
79
  values = []
80
  val = packed_byte
81
  for _ in range(4):
82
  values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1}
83
  val //= 3
84
  return values
 
 
 
 
 
 
 
 
85
  ```
86
 
87
+ ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
+ - **Quality not benchmarked** - May have significant degradation vs original
90
+ - **Requires custom runtime** - Standard transformers doesn't support ternary weights
91
+ - **Experimental** - Not intended for production use without evaluation
92
+ - GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors
 
 
93
 
94
  ## License
95
 
96
+ MIT License (inherited from [microsoft/phi-4](https://huggingface.co/microsoft/phi-4))
 
 
97
 
98
  ## Citation
99
 
100
  ```bibtex
101
  @misc{phi4-bitnet-2025,
102
+ title={Phi-4-BitNet-1.58b: Experimental BitNet Quantization of Phi-4},
103
  author={Tzervas},
104
  year={2025},
105
+ url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b}
 
106
  }
107
  ```