Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +54 -212

README.md CHANGED Viewed

@@ -7,259 +7,101 @@ tags:
   - ternary
   - 1.58-bit
   - phi-4
-  - code-generation
-  - compressed
 library_name: safetensors
 pipeline_tag: text-generation
 language:
   - en
-  - de
-  - fr
-  - es
-  - pt
-  - it
-  - nl
-  - pl
-  - ru
-  - ja
-  - ko
-  - zh
-model-index:
-  - name: phi-4-bitnet-1.58b
-    results: []
 ---
 # Phi-4-BitNet-1.58b
-<div align="center">
-**Production-Grade BitNet 1.58-bit Ternary Quantization**
-[![Model Size](https://img.shields.io/badge/Size-4.58GB-blue)](.)
-[![Compression](https://img.shields.io/badge/Compression-6.4x-green)](.)
-[![Format](https://img.shields.io/badge/Format-SafeTensors%20%7C%20GGUF-orange)](.)
-</div>
-## Executive Summary
-This repository contains a BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B parameter model. The quantization achieves **6.4x compression** while maintaining model architecture compatibility, reducing storage requirements from 29.32 GB to 4.58 GB.
-**Key Benefits for Enterprise Deployment:**
-- **84% storage reduction** - Significant infrastructure cost savings
-- **Reduced memory bandwidth** - Faster inference on memory-bound systems
-- **Ternary weights** - Enable specialized hardware acceleration (BitNet kernels)
-- **GGUF format included** - Direct compatibility with llama.cpp and Ollama
----
-## Model Specifications
-### Quantization Summary
 | Property | Value |
 |----------|-------|
-| **Base Model** | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
-| **Architecture** | Phi-3 (Phi3ForCausalLM) |
-| **Parameters** | 14B |
-| **Quantization Method** | BitNet 1.58-bit ({-1, 0, +1}) |
-| **Bits per Weight** | 1.58 |
-| **Group Size** | 64 |
-| **Original Size** | 29.32 GB (BF16) |
-| **Quantized Size** | 4.58 GB (SafeTensors) / 5.57 GB (GGUF TQ2_0) |
-| **Compression Ratio** | 6.40x |
-### Available Formats
-| Format | File | Size | Use Case |
-|--------|------|------|----------|
-| SafeTensors (Sharded) | `model-*.safetensors` | 4.58 GB | Custom BitNet inference engines |
-| GGUF TQ2_0 | `phi4-tq2.gguf` | 5.57 GB | llama.cpp, Ollama, LM Studio |
----
-## Technical Implementation
-### Quantization Algorithm
-The BitNet 1.58-bit quantization implements the following process:
-```
-1. Group Partitioning:    W_grouped = reshape(W, [out_features, in_features/group_size, group_size])
-2. Scale Computation:     scale[g] = mean(|W_grouped[:, g, :]|) + eps
-3. Normalization:         W_norm = W_grouped / scale
-4. Ternary Quantization:  Q = clip(round(W_norm), -1, +1)
-5. Value Mapping:         Q_stored = Q + 1  → {-1,0,+1} → {0,1,2}
-6. Bit Packing:           4 ternary values per byte using base-3 encoding
-```
-### Ternary Encoding Scheme
-```
-Packing Formula: packed_byte = v0 + v1×3 + v2×9 + v3×27
-Example:
-  Values: [-1, 0, +1, -1] → [0, 1, 2, 0]
-  Packed: 0 + 1×3 + 2×9 + 0×27 = 21
-Valid Range: 0-80 (fits in single byte)
-Storage Efficiency: log₂(3⁴) = 6.34 bits for 4 values ≈ 1.58 bits/weight
-```
-### File Structure
-```
-phi-4-bitnet-1.58b/
-├── model-00001-of-00006.safetensors   # Quantized weights (packed U8) + scales (F32)
-├── model-00002-of-00006.safetensors
-├── model-00003-of-00006.safetensors
-├── model-00004-of-00006.safetensors
-├── model-00005-of-00006.safetensors
-├── model-00006-of-00006.safetensors
-├── model.safetensors.index.json       # Shard index
-├── phi4-tq2.gguf                      # GGUF ternary format
-├── config.json                        # Model configuration
-├── tokenizer.json                     # HuggingFace tokenizer
-├── tokenizer_config.json
-└── README.md
-```
----
-## Quantization Tooling & Methodology
-### Tools Used
-| Component | Tool/Library | Version | Purpose |
-|-----------|--------------|---------|---------|
-| **Quantization Engine** | Custom Rust implementation | - | BitNet 1.58-bit quantization |
-| **Tensor Library** | [Candle](https://github.com/huggingface/candle) | 0.8+ | GPU-accelerated tensor operations |
-| **GGUF Conversion** | [llama.cpp](https://github.com/ggerganov/llama.cpp) | Latest | TQ2_0 ternary GGUF format |
-| **Serialization** | SafeTensors | 0.4+ | Efficient weight storage |
-### Quantization Process
-```bash
-# BitNet quantization (custom Rust tool)
-./quantize \
-    --model microsoft/phi-4 \
-    --method bitnet \
-    --group-size 64 \
-    --output ./phi4-bitnet-1.58b \
-    --format safetensors \
-    --streaming enabled
-# GGUF conversion (llama.cpp)
-python convert_hf_to_gguf.py ./phi4-bitnet-1.58b \
-    --outtype tq2_0 \
-    --outfile phi4-tq2.gguf
-```
-### Hardware Requirements
-**Quantization was performed on:**
-- **GPU**: NVIDIA RTX 5080 (16GB VRAM)
-- **Quantization Time**: ~100 seconds
-- **Processing Rate**: 2.4 tensors/second
-- **Peak Memory**: <8GB VRAM (streaming mode with CPU fallback for large tensors)
-**Memory Management:**
-- Tensors >3GB automatically fall back to CPU processing
-- Streaming quantization processes one shard at a time
-- Enables quantization of models larger than GPU memory
----
 ## Usage
-### With llama.cpp / Ollama
 ```bash
-# Using Ollama
-ollama create phi4-bitnet -f Modelfile
-ollama run phi4-bitnet
-# Using llama.cpp directly
-./llama-cli -m phi4-tq2.gguf -p "Write a Python function to sort a list:"
 ```
-### Custom BitNet Inference
 ```python
-from safetensors import safe_open
-import numpy as np
 def unpack_ternary(packed_byte):
-    """Unpack 4 ternary values from a byte."""
     values = []
     val = packed_byte
     for _ in range(4):
         values.append((val % 3) - 1)  # {0,1,2} → {-1,0,+1}
         val //= 3
     return values
-# Load quantized weights
-with safe_open("model-00001-of-00006.safetensors", framework="numpy") as f:
-    packed_weights = f.get_tensor("model.layers.0.self_attn.q_proj.weight")
-    scales = f.get_tensor("model.layers.0.self_attn.q_proj.weight.scales")
-# Dequantize: unpack and multiply by scales
-# ... implementation specific to inference engine
 ```
----
-## Performance Considerations
-### Memory Bandwidth Analysis
-| Format | Memory Footprint | Relative Bandwidth |
-|--------|------------------|-------------------|
-| BF16 (Original) | 29.32 GB | 1.0x |
-| BitNet 1.58b (This) | 4.58 GB | **6.4x reduction** |
-| GGUF TQ2_0 | 5.57 GB | 5.3x reduction |
-### Inference Optimization
-For optimal performance:
-1. **BitNet Kernels**: Use specialized ternary matrix multiplication (no floating-point multiply)
-2. **Memory Mapping**: GGUF format supports mmap for efficient loading
-3. **Batch Processing**: Amortize dequantization overhead across batch dimension
----
-## Limitations & Considerations
-1. **Quality Degradation**: Extreme quantization may impact model quality for complex tasks
-2. **Custom Runtime Required**: Standard transformers library doesn't support ternary inference
-3. **Experimental**: Designed for research and edge deployment experimentation
-4. **Evaluation Pending**: Formal benchmark comparisons in progress
----
 ## License
-This quantized model inherits the license from the base [Microsoft Phi-4 model](https://huggingface.co/microsoft/phi-4) (MIT License).
----
 ## Citation
 ```bibtex
 @misc{phi4-bitnet-2025,
-  title={Phi-4-BitNet-1.58b: Production BitNet Quantization of Phi-4},
   author={Tzervas},
   year={2025},
-  url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b},
-  note={BitNet 1.58-bit ternary quantization achieving 6.4x compression}
 }
 ```
----
-## Acknowledgments
-- **Microsoft Research** for the Phi-4 base model
-- **BitNet** paper authors for the 1.58-bit quantization methodology
-- **Hugging Face Candle** for the Rust ML framework
-- **llama.cpp** for GGUF format and inference engine

   - ternary
   - 1.58-bit
   - phi-4
+  - experimental
 library_name: safetensors
 pipeline_tag: text-generation
 language:
   - en
 ---
 # Phi-4-BitNet-1.58b
+BitNet 1.58-bit ternary quantization of Microsoft's Phi-4 14B model.
+## Overview
+This is an **experimental** BitNet 1.58-bit quantization of the Phi-4 model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte.
+**This is research/experimental work. Quality and performance have not been formally benchmarked.**
+## Specifications
 | Property | Value |
 |----------|-------|
+| Base Model | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
+| Architecture | Phi-3 (Phi3ForCausalLM) |
+| Parameters | 14B |
+| Quantization | BitNet 1.58-bit ternary |
+| Bits per Weight | ~1.58 |
+| Group Size | 64 |
+| Original Size | 29.32 GB (BF16) |
+| Quantized Size | 4.58 GB (SafeTensors) |
+| GGUF Size | 5.57 GB (TQ2_0) |
+| Compression | ~6.4x |
+## Formats
+| Format | File | Description |
+|--------|------|-------------|
+| SafeTensors | `model-*.safetensors` | Sharded quantized weights + scales |
+| GGUF | `phi4-tq2.gguf` | llama.cpp compatible |
+## Quantization Method
+### Algorithm
+1. Reshape weights into groups of 64
+2. Compute per-group scale: `scale = mean(|weights|)`
+3. Normalize and round to nearest ternary: `q = round(w / scale)` clamped to {-1, 0, +1}
+4. Map to unsigned: {-1, 0, +1} → {0, 1, 2}
+5. Pack 4 values per byte: `v0 + v1*3 + v2*9 + v3*27`
+### Tooling
+- **Quantization**: Custom Rust tool using [Candle](https://github.com/huggingface/candle)
+- **GGUF Conversion**: [llama.cpp](https://github.com/ggerganov/llama.cpp) convert_hf_to_gguf.py
+### Hardware Used
+- GPU: NVIDIA RTX 5080 (16GB VRAM)
+- Quantization time: ~100 seconds
+- Memory: Streaming mode with CPU fallback for large tensors
 ## Usage
+### With Ollama/llama.cpp
 ```bash
+# llama.cpp
+./llama-cli -m phi4-tq2.gguf -p "Your prompt here"
 ```
+### Unpacking Weights (Python)
 ```python
 def unpack_ternary(packed_byte):
+    """Unpack 4 ternary values from byte."""
     values = []
     val = packed_byte
     for _ in range(4):
         values.append((val % 3) - 1)  # {0,1,2} → {-1,0,+1}
         val //= 3
     return values
 ```
+## Limitations
+- **Quality not benchmarked** - May have significant degradation vs original
+- **Requires custom runtime** - Standard transformers doesn't support ternary weights
+- **Experimental** - Not intended for production use without evaluation
+- GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors
 ## License
+MIT License (inherited from [microsoft/phi-4](https://huggingface.co/microsoft/phi-4))
 ## Citation
 ```bibtex
 @misc{phi4-bitnet-2025,
+  title={Phi-4-BitNet-1.58b: Experimental BitNet Quantization of Phi-4},
   author={Tzervas},
   year={2025},
+  url={https://huggingface.co/tzervas/phi-4-bitnet-1.58b}
 }
 ```