DeepSeek-OCR MBQ Quantized Model (Standalone)

This is a fully standalone quantized version of deepseek-ai/DeepSeek-OCR using MBQ (Mixed-precision post-training quantization).

✨ No need to download the original model - all architecture files included!

Model Details

Base Model: deepseek-ai/DeepSeek-OCR
Quantization Method: MBQ (Mixed-precision Quantization)
Weight Precision: 4-bit (mixed with 8-bit for sensitive layers)
Activation Precision: 8-bit
Format: SafeTensors (int8 quantized with scales)
Standalone: All architecture files included ✅

Quantization Statistics

Metric	Value
Original Size	6,672 MB (6.67 GB)
Quantized Size	3,510 MB (3.51 GB)
Size Reduction	3,162 MB (47.4%)
Compression Ratio	1.90x

Quick Start (Standalone - No Original Model Needed!)

Installation

pip install torch transformers safetensors accelerate pillow

Simple Loading (Recommended)

import torch
from transformers import AutoTokenizer, AutoModel

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer directly - all files included!
tokenizer = AutoTokenizer.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True
)

model = AutoModel.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load the quantized weights using the helper
from load_mbq_model import load_mbq_model
state_dict = load_mbq_model("./")  # Assumes files are in current directory

model.load_state_dict(state_dict)
model = model.to(device).eval()

print("✅ Model loaded successfully!")

Manual Loading with Dequantization

import torch
from transformers import AutoTokenizer, AutoModel
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True
)

# Load quantized weights
state_dict = load_file("model.safetensors")

# Separate weights and scales
weights = {}
scales = {}

for name, param in state_dict.items():
    if '.scale' in name:
        scales[name.replace('.scale', '')] = param
    else:
        weights[name] = param

# Dequantize weights
dequantized_state_dict = {}
for name, param in weights.items():
    if name in scales:
        scale = scales[name]
        dequantized = (param.float() * scale).to(torch.bfloat16)
        dequantized_state_dict[name] = dequantized
    else:
        dequantized_state_dict[name] = param

# Load model architecture (included in this repo!)
model = AutoModel.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load the quantized weights
model.load_state_dict(dequantized_state_dict)
model = model.to(device).eval()

print("✅ Model loaded successfully!")

Model Files

Core Files

model.safetensors (3.51 GB): Quantized model weights (int8 + scales)
load_mbq_model.py: Helper script for loading

Architecture Files (from original model)

modeling_deepseekocr.py: Main model architecture
modeling_deepseekv2.py: DeepSeek V2 backbone
configuration_deepseek_v2.py: Model configuration
deepencoder.py: Vision encoder
conversation.py: Conversation utilities
processor_config.json: Processor configuration

Tokenizer & Config

tokenizer.json: Tokenizer vocabulary
tokenizer_config.json: Tokenizer configuration
config.json: Model configuration
special_tokens_map.json: Special tokens

Metadata

quantization_metadata.json: Quantization details
quantization_report.json: Compression statistics

Advantages

✅ Standalone: All files included, no need to download original model
✅ Smaller Size: 47% reduction in model size
✅ Easy Loading: Simple AutoModel.from_pretrained() with trust_remote_code=True
✅ Compatible: Works with standard transformers library
✅ Preserved Quality: Mixed-precision maintains model performance

MBQ Methodology

MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:

Sensitivity Analysis: Computes sensitivity scores using Hessian approximation
Mixed Precision: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit
Symmetric Quantization: Efficient quantization scheme for weights and activations
Storage: Weights stored as int8 with separate scale factors for true compression

Performance

Memory Usage: Reduced by 47.4%
Model Size: From 6.67 GB to 3.51 GB
Standalone: No dependency on original model repo ✅
Inference: Lower memory footprint, faster loading

Citation

If you use this quantized model, please cite:

@misc{deepseek-ocr-mbq,
  author = {SamMikaelson},
  title = {DeepSeek-OCR MBQ Quantized Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SamMikaelson/deepseek-ocr-mbq-w4bit}}
}

Original model:

@misc{deepseek-ocr,
  title={DeepSeek-OCR},
  author={DeepSeek-AI},
  year={2024},
  howpublished={\url{https://huggingface.co/deepseek-ai/DeepSeek-OCR}}
}

License

MIT License (same as the base model)

Troubleshooting

If you encounter issues loading the model:

Ensure trust_remote_code=True is set
Install required packages: pip install -r requirements.txt
Check that you're using transformers >= 4.40.0
Use the provided load_mbq_model.py helper script

For questions or issues, please open an issue on the model repository.

Downloads last month: 5

Model tree for SamMikaelson/deepseek-ocr-mbq-w4bit

Base model

deepseek-ai/DeepSeek-OCR

Finetuned

(125)

this model