DeepSeek-OCR MBQ Quantized Model (Standalone)

This is a fully standalone quantized version of deepseek-ai/DeepSeek-OCR using MBQ (Mixed-precision post-training quantization).

No need to download the original model - all architecture files included!

Model Details

  • Base Model: deepseek-ai/DeepSeek-OCR
  • Quantization Method: MBQ (Mixed-precision Quantization)
  • Weight Precision: 4-bit (mixed with 8-bit for sensitive layers)
  • Activation Precision: 8-bit
  • Format: SafeTensors (int8 quantized with scales)
  • Standalone: All architecture files included ✅

Quantization Statistics

Metric Value
Original Size 6,672 MB (6.67 GB)
Quantized Size 3,510 MB (3.51 GB)
Size Reduction 3,162 MB (47.4%)
Compression Ratio 1.90x

Quick Start (Standalone - No Original Model Needed!)

Installation

pip install torch transformers safetensors accelerate pillow

Simple Loading (Recommended)

import torch
from transformers import AutoTokenizer, AutoModel

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer directly - all files included!
tokenizer = AutoTokenizer.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True
)

model = AutoModel.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load the quantized weights using the helper
from load_mbq_model import load_mbq_model
state_dict = load_mbq_model("./")  # Assumes files are in current directory

model.load_state_dict(state_dict)
model = model.to(device).eval()

print("✅ Model loaded successfully!")

Manual Loading with Dequantization

import torch
from transformers import AutoTokenizer, AutoModel
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True
)

# Load quantized weights
state_dict = load_file("model.safetensors")

# Separate weights and scales
weights = {}
scales = {}

for name, param in state_dict.items():
    if '.scale' in name:
        scales[name.replace('.scale', '')] = param
    else:
        weights[name] = param

# Dequantize weights
dequantized_state_dict = {}
for name, param in weights.items():
    if name in scales:
        scale = scales[name]
        dequantized = (param.float() * scale).to(torch.bfloat16)
        dequantized_state_dict[name] = dequantized
    else:
        dequantized_state_dict[name] = param

# Load model architecture (included in this repo!)
model = AutoModel.from_pretrained(
    "SamMikaelson/deepseek-ocr-mbq-w4bit",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load the quantized weights
model.load_state_dict(dequantized_state_dict)
model = model.to(device).eval()

print("✅ Model loaded successfully!")

Model Files

Core Files

  • model.safetensors (3.51 GB): Quantized model weights (int8 + scales)
  • load_mbq_model.py: Helper script for loading

Architecture Files (from original model)

  • modeling_deepseekocr.py: Main model architecture
  • modeling_deepseekv2.py: DeepSeek V2 backbone
  • configuration_deepseek_v2.py: Model configuration
  • deepencoder.py: Vision encoder
  • conversation.py: Conversation utilities
  • processor_config.json: Processor configuration

Tokenizer & Config

  • tokenizer.json: Tokenizer vocabulary
  • tokenizer_config.json: Tokenizer configuration
  • config.json: Model configuration
  • special_tokens_map.json: Special tokens

Metadata

  • quantization_metadata.json: Quantization details
  • quantization_report.json: Compression statistics

Advantages

Standalone: All files included, no need to download original model
Smaller Size: 47% reduction in model size
Easy Loading: Simple AutoModel.from_pretrained() with trust_remote_code=True
Compatible: Works with standard transformers library
Preserved Quality: Mixed-precision maintains model performance

MBQ Methodology

MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:

  1. Sensitivity Analysis: Computes sensitivity scores using Hessian approximation
  2. Mixed Precision: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit
  3. Symmetric Quantization: Efficient quantization scheme for weights and activations
  4. Storage: Weights stored as int8 with separate scale factors for true compression

Performance

  • Memory Usage: Reduced by 47.4%
  • Model Size: From 6.67 GB to 3.51 GB
  • Standalone: No dependency on original model repo ✅
  • Inference: Lower memory footprint, faster loading

Citation

If you use this quantized model, please cite:

@misc{deepseek-ocr-mbq,
  author = {SamMikaelson},
  title = {DeepSeek-OCR MBQ Quantized Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SamMikaelson/deepseek-ocr-mbq-w4bit}}
}

Original model:

@misc{deepseek-ocr,
  title={DeepSeek-OCR},
  author={DeepSeek-AI},
  year={2024},
  howpublished={\url{https://huggingface.co/deepseek-ai/DeepSeek-OCR}}
}

License

MIT License (same as the base model)

Troubleshooting

If you encounter issues loading the model:

  1. Ensure trust_remote_code=True is set
  2. Install required packages: pip install -r requirements.txt
  3. Check that you're using transformers >= 4.40.0
  4. Use the provided load_mbq_model.py helper script

For questions or issues, please open an issue on the model repository.

Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SamMikaelson/deepseek-ocr-mbq-w4bit

Finetuned
(111)
this model