SamMikaelson
/

deepseek-ocr-mbq-w4bit

@@ -10,14 +10,15 @@ tags:
 - mbq
 - deepseek
 - vision-language
 base_model: deepseek-ai/DeepSeek-OCR
 ---
-# DeepSeek-OCR MBQ Quantized Model
-This is a quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using **MBQ (Mixed-precision post-training quantization)**.
-⚡ **Ready-to-use standalone model** with `model.safetensors` - no special loading code required!
 ## Model Details
@@ -25,96 +26,156 @@ This is a quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co
 - **Quantization Method**: MBQ (Mixed-precision Quantization)
 - **Weight Precision**: 4-bit (mixed with 8-bit for sensitive layers)
 - **Activation Precision**: 8-bit
-- **Preserve Ratio**: 15.0% of layers kept at 8-bit
-- **Format**: SafeTensors (bfloat16 dequantized for compatibility)
 ## Quantization Statistics
 | Metric | Value |
 |--------|-------|
-| Original Size | 6362.54 MB |
-| Quantized Size | 2223.75 MB |
-| **SafeTensors Size** | **3352.13 MB** |
-| **Size Reduction** | **4138.79 MB (65.05%)** |
-| **Compression Ratio** | **2.86x** |
-| Quantized Layers | 2342 |
-## Usage
-### Standard Loading (Recommended)
 ```python
-from transformers import AutoModel, AutoTokenizer
-# Load model directly - just like any other HF model!
-model = AutoModel.from_pretrained(
-    "SamMikaelson/deepseek-ocr-mbq-w4bit",
-    trust_remote_code=True,
-    torch_dtype="auto"
-)
 tokenizer = AutoTokenizer.from_pretrained(
     "SamMikaelson/deepseek-ocr-mbq-w4bit",
     trust_remote_code=True
 )
-# Use the model normally
-# model.eval()
-# outputs = model(inputs)
 ```
-### Access Quantization Metadata
 ```python
 import torch
-# Load quantization info (optional - for analysis)
-quantized_info = torch.load("quantized_weights.pt", map_location="cpu")
-print(f"Compression ratio: {quantized_info['metadata']['stats']['compression_ratio']:.2f}x")
-print(f"Size reduction: {quantized_info['metadata']['stats']['size_reduction_percent']:.2f}%")
-print(f"Bit allocation: {quantized_info['metadata']['bit_allocation']}")
 ```
 ## Model Files
-- **model.safetensors**: Main model weights (dequantized to bfloat16 for compatibility)
-- **quantized_weights.pt**: Original quantized weights + metadata
 - **config.json**: Model configuration
-- **tokenizer files**: Tokenizer configuration
-- **quantization_report.json**: Detailed quantization statistics
-## Quantization Configuration
-```python
-{
-    'w_bit': 4,
-    'a_bit': 8,
-    'mixed_precision': True,
-    'sensitivity_metric': 'hessian',
-    'preserve_ratio': 0.15
-}
-```
 ## MBQ Methodology
 MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:
-1. **Sensitivity Analysis**: Computes sensitivity scores using hessian metric
-2. **Mixed Precision**: High-sensitivity layers (top 15.0%) → 8-bit, others → 4-bit
 3. **Symmetric Quantization**: Efficient quantization scheme for weights and activations
-4. **Dequantization**: Weights stored as bfloat16 in safetensors for full compatibility
 ## Performance
-- **Memory Usage**: Reduced by 65.05%
-- **Model Size**: From 6362.54 MB to 3352.13 MB
-- **Compatibility**: Works with standard transformers library
-- **Inference**: Lower memory footprint, faster inference on resource-constrained devices
-## Notes
-The model.safetensors file contains dequantized weights in bfloat16 format for maximum compatibility with the transformers library. While this is larger than the fully quantized version, it still achieves significant size reduction (65.05%) while maintaining ease of use.
-For the fully compressed quantized weights, see `quantized_weights.pt`.
 ## Citation
@@ -142,4 +203,15 @@ Original model:
 ## License
-Same as the base model: MIT License

 - mbq
 - deepseek
 - vision-language
+- standalone
 base_model: deepseek-ai/DeepSeek-OCR
 ---
+# DeepSeek-OCR MBQ Quantized Model (Standalone)
+This is a **fully standalone** quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using **MBQ (Mixed-precision post-training quantization)**.
+✨ **No need to download the original model** - all architecture files included!
 ## Model Details
 - **Quantization Method**: MBQ (Mixed-precision Quantization)
 - **Weight Precision**: 4-bit (mixed with 8-bit for sensitive layers)
 - **Activation Precision**: 8-bit
+- **Format**: SafeTensors (int8 quantized with scales)
+- **Standalone**: All architecture files included ✅
 ## Quantization Statistics
 | Metric | Value |
 |--------|-------|
+| Original Size | 6,672 MB (6.67 GB) |
+| **Quantized Size** | **3,510 MB (3.51 GB)** |
+| **Size Reduction** | **3,162 MB (47.4%)** |
+| **Compression Ratio** | **1.90x** |
+## Quick Start (Standalone - No Original Model Needed!)
+### Installation
+```bash
+pip install torch transformers safetensors accelerate pillow
+```
+### Simple Loading (Recommended)
 ```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+# Device setup
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load model and tokenizer directly - all files included!
 tokenizer = AutoTokenizer.from_pretrained(
     "SamMikaelson/deepseek-ocr-mbq-w4bit",
     trust_remote_code=True
 )
+model = AutoModel.from_pretrained(
+    "SamMikaelson/deepseek-ocr-mbq-w4bit",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16
+)
+# Load the quantized weights using the helper
+from load_mbq_model import load_mbq_model
+state_dict = load_mbq_model("./")  # Assumes files are in current directory
+model.load_state_dict(state_dict)
+model = model.to(device).eval()
+print("✅ Model loaded successfully!")
 ```
+### Manual Loading with Dequantization
 ```python
 import torch
+from transformers import AutoTokenizer, AutoModel
+from safetensors.torch import load_file
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "SamMikaelson/deepseek-ocr-mbq-w4bit",
+    trust_remote_code=True
+)
+# Load quantized weights
+state_dict = load_file("model.safetensors")
+# Separate weights and scales
+weights = {}
+scales = {}
+for name, param in state_dict.items():
+    if '.scale' in name:
+        scales[name.replace('.scale', '')] = param
+    else:
+        weights[name] = param
+# Dequantize weights
+dequantized_state_dict = {}
+for name, param in weights.items():
+    if name in scales:
+        scale = scales[name]
+        dequantized = (param.float() * scale).to(torch.bfloat16)
+        dequantized_state_dict[name] = dequantized
+    else:
+        dequantized_state_dict[name] = param
+# Load model architecture (included in this repo!)
+model = AutoModel.from_pretrained(
+    "SamMikaelson/deepseek-ocr-mbq-w4bit",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16
+)
+# Load the quantized weights
+model.load_state_dict(dequantized_state_dict)
+model = model.to(device).eval()
+print("✅ Model loaded successfully!")
 ```
 ## Model Files
+### Core Files
+- **model.safetensors** (3.51 GB): Quantized model weights (int8 + scales)
+- **load_mbq_model.py**: Helper script for loading
+### Architecture Files (from original model)
+- **modeling_deepseekocr.py**: Main model architecture
+- **modeling_deepseekv2.py**: DeepSeek V2 backbone
+- **configuration_deepseek_v2.py**: Model configuration
+- **deepencoder.py**: Vision encoder
+- **conversation.py**: Conversation utilities
+- **processor_config.json**: Processor configuration
+### Tokenizer & Config
+- **tokenizer.json**: Tokenizer vocabulary
+- **tokenizer_config.json**: Tokenizer configuration
 - **config.json**: Model configuration
+- **special_tokens_map.json**: Special tokens
+### Metadata
+- **quantization_metadata.json**: Quantization details
+- **quantization_report.json**: Compression statistics
+## Advantages
+✅ **Standalone**: All files included, no need to download original model
+✅ **Smaller Size**: 47% reduction in model size
+✅ **Easy Loading**: Simple AutoModel.from_pretrained() with trust_remote_code=True
+✅ **Compatible**: Works with standard transformers library
+✅ **Preserved Quality**: Mixed-precision maintains model performance
 ## MBQ Methodology
 MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:
+1. **Sensitivity Analysis**: Computes sensitivity scores using Hessian approximation
+2. **Mixed Precision**: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit
 3. **Symmetric Quantization**: Efficient quantization scheme for weights and activations
+4. **Storage**: Weights stored as int8 with separate scale factors for true compression
 ## Performance
+- **Memory Usage**: Reduced by 47.4%
+- **Model Size**: From 6.67 GB to 3.51 GB
+- **Standalone**: No dependency on original model repo ✅
+- **Inference**: Lower memory footprint, faster loading
 ## Citation
 ## License
+MIT License (same as the base model)
+## Troubleshooting
+If you encounter issues loading the model:
+1. Ensure `trust_remote_code=True` is set
+2. Install required packages: `pip install -r requirements.txt`
+3. Check that you're using transformers >= 4.40.0
+4. Use the provided `load_mbq_model.py` helper script
+For questions or issues, please open an issue on the model repository.