--- language: - en - zh license: mit library_name: transformers tags: - ocr - quantization - mbq - deepseek - vision-language - standalone base_model: deepseek-ai/DeepSeek-OCR --- # DeepSeek-OCR MBQ Quantized Model (Standalone) This is a **fully standalone** quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using **MBQ (Mixed-precision post-training quantization)**. ✨ **No need to download the original model** - all architecture files included! ## Model Details - **Base Model**: deepseek-ai/DeepSeek-OCR - **Quantization Method**: MBQ (Mixed-precision Quantization) - **Weight Precision**: 4-bit (mixed with 8-bit for sensitive layers) - **Activation Precision**: 8-bit - **Format**: SafeTensors (int8 quantized with scales) - **Standalone**: All architecture files included ✅ ## Quantization Statistics | Metric | Value | |--------|-------| | Original Size | 6,672 MB (6.67 GB) | | **Quantized Size** | **3,510 MB (3.51 GB)** | | **Size Reduction** | **3,162 MB (47.4%)** | | **Compression Ratio** | **1.90x** | ## Quick Start (Standalone - No Original Model Needed!) ### Installation ```bash pip install torch transformers safetensors accelerate pillow ``` ### Simple Loading (Recommended) ```python import torch from transformers import AutoTokenizer, AutoModel # Device setup device = "cuda" if torch.cuda.is_available() else "cpu" # Load model and tokenizer directly - all files included! tokenizer = AutoTokenizer.from_pretrained( "SamMikaelson/deepseek-ocr-mbq-w4bit", trust_remote_code=True ) model = AutoModel.from_pretrained( "SamMikaelson/deepseek-ocr-mbq-w4bit", trust_remote_code=True, torch_dtype=torch.bfloat16 ) # Load the quantized weights using the helper from load_mbq_model import load_mbq_model state_dict = load_mbq_model("./") # Assumes files are in current directory model.load_state_dict(state_dict) model = model.to(device).eval() print("✅ Model loaded successfully!") ``` ### Manual Loading with Dequantization ```python import torch from transformers import AutoTokenizer, AutoModel from safetensors.torch import load_file device = "cuda" if torch.cuda.is_available() else "cpu" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( "SamMikaelson/deepseek-ocr-mbq-w4bit", trust_remote_code=True ) # Load quantized weights state_dict = load_file("model.safetensors") # Separate weights and scales weights = {} scales = {} for name, param in state_dict.items(): if '.scale' in name: scales[name.replace('.scale', '')] = param else: weights[name] = param # Dequantize weights dequantized_state_dict = {} for name, param in weights.items(): if name in scales: scale = scales[name] dequantized = (param.float() * scale).to(torch.bfloat16) dequantized_state_dict[name] = dequantized else: dequantized_state_dict[name] = param # Load model architecture (included in this repo!) model = AutoModel.from_pretrained( "SamMikaelson/deepseek-ocr-mbq-w4bit", trust_remote_code=True, torch_dtype=torch.bfloat16 ) # Load the quantized weights model.load_state_dict(dequantized_state_dict) model = model.to(device).eval() print("✅ Model loaded successfully!") ``` ## Model Files ### Core Files - **model.safetensors** (3.51 GB): Quantized model weights (int8 + scales) - **load_mbq_model.py**: Helper script for loading ### Architecture Files (from original model) - **modeling_deepseekocr.py**: Main model architecture - **modeling_deepseekv2.py**: DeepSeek V2 backbone - **configuration_deepseek_v2.py**: Model configuration - **deepencoder.py**: Vision encoder - **conversation.py**: Conversation utilities - **processor_config.json**: Processor configuration ### Tokenizer & Config - **tokenizer.json**: Tokenizer vocabulary - **tokenizer_config.json**: Tokenizer configuration - **config.json**: Model configuration - **special_tokens_map.json**: Special tokens ### Metadata - **quantization_metadata.json**: Quantization details - **quantization_report.json**: Compression statistics ## Advantages ✅ **Standalone**: All files included, no need to download original model ✅ **Smaller Size**: 47% reduction in model size ✅ **Easy Loading**: Simple AutoModel.from_pretrained() with trust_remote_code=True ✅ **Compatible**: Works with standard transformers library ✅ **Preserved Quality**: Mixed-precision maintains model performance ## MBQ Methodology MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity: 1. **Sensitivity Analysis**: Computes sensitivity scores using Hessian approximation 2. **Mixed Precision**: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit 3. **Symmetric Quantization**: Efficient quantization scheme for weights and activations 4. **Storage**: Weights stored as int8 with separate scale factors for true compression ## Performance - **Memory Usage**: Reduced by 47.4% - **Model Size**: From 6.67 GB to 3.51 GB - **Standalone**: No dependency on original model repo ✅ - **Inference**: Lower memory footprint, faster loading ## Citation If you use this quantized model, please cite: ```bibtex @misc{deepseek-ocr-mbq, author = {SamMikaelson}, title = {DeepSeek-OCR MBQ Quantized Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/SamMikaelson/deepseek-ocr-mbq-w4bit}} } ``` Original model: ```bibtex @misc{deepseek-ocr, title={DeepSeek-OCR}, author={DeepSeek-AI}, year={2024}, howpublished={\url{https://huggingface.co/deepseek-ai/DeepSeek-OCR}} } ``` ## License MIT License (same as the base model) ## Troubleshooting If you encounter issues loading the model: 1. Ensure `trust_remote_code=True` is set 2. Install required packages: `pip install -r requirements.txt` 3. Check that you're using transformers >= 4.40.0 4. Use the provided `load_mbq_model.py` helper script For questions or issues, please open an issue on the model repository.