Add README.md

8b71211 verified 13 days ago

6.08 kB

	---
	language:
	- en
	- zh
	license: mit
	library_name: transformers
	tags:
	- ocr
	- quantization
	- mbq
	- deepseek
	- vision-language
	- standalone
	base_model: deepseek-ai/DeepSeek-OCR
	---

	# DeepSeek-OCR MBQ Quantized Model (Standalone)

	This is a fully standalone quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using MBQ (Mixed-precision post-training quantization).

	✨ No need to download the original model - all architecture files included!

	## Model Details

	- Base Model: deepseek-ai/DeepSeek-OCR
	- Quantization Method: MBQ (Mixed-precision Quantization)
	- Weight Precision: 4-bit (mixed with 8-bit for sensitive layers)
	- Activation Precision: 8-bit
	- Format: SafeTensors (int8 quantized with scales)
	- Standalone: All architecture files included ✅

	## Quantization Statistics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Original Size \| 6,672 MB (6.67 GB) \|
	\| Quantized Size \| 3,510 MB (3.51 GB) \|
	\| Size Reduction \| 3,162 MB (47.4%) \|
	\| Compression Ratio \| 1.90x \|

	## Quick Start (Standalone - No Original Model Needed!)

	### Installation

	```bash
	pip install torch transformers safetensors accelerate pillow
	```

	### Simple Loading (Recommended)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	# Device setup
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Load model and tokenizer directly - all files included!
	tokenizer = AutoTokenizer.from_pretrained(
	"SamMikaelson/deepseek-ocr-mbq-w4bit",
	trust_remote_code=True
	)

	model = AutoModel.from_pretrained(
	"SamMikaelson/deepseek-ocr-mbq-w4bit",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	)

	# Load the quantized weights using the helper
	from load_mbq_model import load_mbq_model
	state_dict = load_mbq_model("./") # Assumes files are in current directory

	model.load_state_dict(state_dict)
	model = model.to(device).eval()

	print("✅ Model loaded successfully!")
	```

	### Manual Loading with Dequantization

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel
	from safetensors.torch import load_file

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	"SamMikaelson/deepseek-ocr-mbq-w4bit",
	trust_remote_code=True
	)

	# Load quantized weights
	state_dict = load_file("model.safetensors")

	# Separate weights and scales
	weights = {}
	scales = {}

	for name, param in state_dict.items():
	if '.scale' in name:
	scales[name.replace('.scale', '')] = param
	else:
	weights[name] = param

	# Dequantize weights
	dequantized_state_dict = {}
	for name, param in weights.items():
	if name in scales:
	scale = scales[name]
	dequantized = (param.float() * scale).to(torch.bfloat16)
	dequantized_state_dict[name] = dequantized
	else:
	dequantized_state_dict[name] = param

	# Load model architecture (included in this repo!)
	model = AutoModel.from_pretrained(
	"SamMikaelson/deepseek-ocr-mbq-w4bit",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	)

	# Load the quantized weights
	model.load_state_dict(dequantized_state_dict)
	model = model.to(device).eval()

	print("✅ Model loaded successfully!")
	```

	## Model Files

	### Core Files
	- model.safetensors (3.51 GB): Quantized model weights (int8 + scales)
	- load_mbq_model.py: Helper script for loading

	### Architecture Files (from original model)
	- modeling_deepseekocr.py: Main model architecture
	- modeling_deepseekv2.py: DeepSeek V2 backbone
	- configuration_deepseek_v2.py: Model configuration
	- deepencoder.py: Vision encoder
	- conversation.py: Conversation utilities
	- processor_config.json: Processor configuration

	### Tokenizer & Config
	- tokenizer.json: Tokenizer vocabulary
	- tokenizer_config.json: Tokenizer configuration
	- config.json: Model configuration
	- special_tokens_map.json: Special tokens

	### Metadata
	- quantization_metadata.json: Quantization details
	- quantization_report.json: Compression statistics

	## Advantages

	✅ Standalone: All files included, no need to download original model
	✅ Smaller Size: 47% reduction in model size
	✅ Easy Loading: Simple AutoModel.from_pretrained() with trust_remote_code=True
	✅ Compatible: Works with standard transformers library
	✅ Preserved Quality: Mixed-precision maintains model performance

	## MBQ Methodology

	MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:

	1. Sensitivity Analysis: Computes sensitivity scores using Hessian approximation
	2. Mixed Precision: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit
	3. Symmetric Quantization: Efficient quantization scheme for weights and activations
	4. Storage: Weights stored as int8 with separate scale factors for true compression

	## Performance

	- Memory Usage: Reduced by 47.4%
	- Model Size: From 6.67 GB to 3.51 GB
	- Standalone: No dependency on original model repo ✅
	- Inference: Lower memory footprint, faster loading

	## Citation

	If you use this quantized model, please cite:

	```bibtex
	@misc{deepseek-ocr-mbq,
	author = {SamMikaelson},
	title = {DeepSeek-OCR MBQ Quantized Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/SamMikaelson/deepseek-ocr-mbq-w4bit}}
	}
	```

	Original model:
	```bibtex
	@misc{deepseek-ocr,
	title={DeepSeek-OCR},
	author={DeepSeek-AI},
	year={2024},
	howpublished={\url{https://huggingface.co/deepseek-ai/DeepSeek-OCR}}
	}
	```

	## License

	MIT License (same as the base model)

	## Troubleshooting

	If you encounter issues loading the model:

	1. Ensure `trust_remote_code=True` is set
	2. Install required packages: `pip install -r requirements.txt`
	3. Check that you're using transformers >= 4.40.0
	4. Use the provided `load_mbq_model.py` helper script

	For questions or issues, please open an issue on the model repository.