SamMikaelson commited on
Commit
8b71211
·
verified ·
1 Parent(s): ec872d4

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -57
README.md CHANGED
@@ -10,14 +10,15 @@ tags:
10
  - mbq
11
  - deepseek
12
  - vision-language
 
13
  base_model: deepseek-ai/DeepSeek-OCR
14
  ---
15
 
16
- # DeepSeek-OCR MBQ Quantized Model
17
 
18
- This is a quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using **MBQ (Mixed-precision post-training quantization)**.
19
 
20
- **Ready-to-use standalone model** with `model.safetensors` - no special loading code required!
21
 
22
  ## Model Details
23
 
@@ -25,96 +26,156 @@ This is a quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co
25
  - **Quantization Method**: MBQ (Mixed-precision Quantization)
26
  - **Weight Precision**: 4-bit (mixed with 8-bit for sensitive layers)
27
  - **Activation Precision**: 8-bit
28
- - **Preserve Ratio**: 15.0% of layers kept at 8-bit
29
- - **Format**: SafeTensors (bfloat16 dequantized for compatibility)
30
 
31
  ## Quantization Statistics
32
 
33
  | Metric | Value |
34
  |--------|-------|
35
- | Original Size | 6362.54 MB |
36
- | Quantized Size | 2223.75 MB |
37
- | **SafeTensors Size** | **3352.13 MB** |
38
- | **Size Reduction** | **4138.79 MB (65.05%)** |
39
- | **Compression Ratio** | **2.86x** |
40
- | Quantized Layers | 2342 |
41
 
42
- ## Usage
 
 
 
 
 
 
 
 
43
 
44
- ### Standard Loading (Recommended)
45
  ```python
46
- from transformers import AutoModel, AutoTokenizer
 
47
 
48
- # Load model directly - just like any other HF model!
49
- model = AutoModel.from_pretrained(
50
- "SamMikaelson/deepseek-ocr-mbq-w4bit",
51
- trust_remote_code=True,
52
- torch_dtype="auto"
53
- )
54
 
 
55
  tokenizer = AutoTokenizer.from_pretrained(
56
  "SamMikaelson/deepseek-ocr-mbq-w4bit",
57
  trust_remote_code=True
58
  )
59
 
60
- # Use the model normally
61
- # model.eval()
62
- # outputs = model(inputs)
 
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
- ### Access Quantization Metadata
 
66
  ```python
67
  import torch
 
 
68
 
69
- # Load quantization info (optional - for analysis)
70
- quantized_info = torch.load("quantized_weights.pt", map_location="cpu")
71
 
72
- print(f"Compression ratio: {quantized_info['metadata']['stats']['compression_ratio']:.2f}x")
73
- print(f"Size reduction: {quantized_info['metadata']['stats']['size_reduction_percent']:.2f}%")
74
- print(f"Bit allocation: {quantized_info['metadata']['bit_allocation']}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ```
76
 
77
  ## Model Files
78
 
79
- - **model.safetensors**: Main model weights (dequantized to bfloat16 for compatibility)
80
- - **quantized_weights.pt**: Original quantized weights + metadata
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  - **config.json**: Model configuration
82
- - **tokenizer files**: Tokenizer configuration
83
- - **quantization_report.json**: Detailed quantization statistics
84
 
85
- ## Quantization Configuration
 
 
86
 
87
- ```python
88
- {
89
- 'w_bit': 4,
90
- 'a_bit': 8,
91
- 'mixed_precision': True,
92
- 'sensitivity_metric': 'hessian',
93
- 'preserve_ratio': 0.15
94
- }
95
- ```
96
 
97
  ## MBQ Methodology
98
 
99
  MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:
100
 
101
- 1. **Sensitivity Analysis**: Computes sensitivity scores using hessian metric
102
- 2. **Mixed Precision**: High-sensitivity layers (top 15.0%) → 8-bit, others → 4-bit
103
  3. **Symmetric Quantization**: Efficient quantization scheme for weights and activations
104
- 4. **Dequantization**: Weights stored as bfloat16 in safetensors for full compatibility
105
 
106
  ## Performance
107
 
108
- - **Memory Usage**: Reduced by 65.05%
109
- - **Model Size**: From 6362.54 MB to 3352.13 MB
110
- - **Compatibility**: Works with standard transformers library
111
- - **Inference**: Lower memory footprint, faster inference on resource-constrained devices
112
-
113
- ## Notes
114
-
115
- The model.safetensors file contains dequantized weights in bfloat16 format for maximum compatibility with the transformers library. While this is larger than the fully quantized version, it still achieves significant size reduction (65.05%) while maintaining ease of use.
116
-
117
- For the fully compressed quantized weights, see `quantized_weights.pt`.
118
 
119
  ## Citation
120
 
@@ -142,4 +203,15 @@ Original model:
142
 
143
  ## License
144
 
145
- Same as the base model: MIT License
 
 
 
 
 
 
 
 
 
 
 
 
10
  - mbq
11
  - deepseek
12
  - vision-language
13
+ - standalone
14
  base_model: deepseek-ai/DeepSeek-OCR
15
  ---
16
 
17
+ # DeepSeek-OCR MBQ Quantized Model (Standalone)
18
 
19
+ This is a **fully standalone** quantized version of [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) using **MBQ (Mixed-precision post-training quantization)**.
20
 
21
+ **No need to download the original model** - all architecture files included!
22
 
23
  ## Model Details
24
 
 
26
  - **Quantization Method**: MBQ (Mixed-precision Quantization)
27
  - **Weight Precision**: 4-bit (mixed with 8-bit for sensitive layers)
28
  - **Activation Precision**: 8-bit
29
+ - **Format**: SafeTensors (int8 quantized with scales)
30
+ - **Standalone**: All architecture files included
31
 
32
  ## Quantization Statistics
33
 
34
  | Metric | Value |
35
  |--------|-------|
36
+ | Original Size | 6,672 MB (6.67 GB) |
37
+ | **Quantized Size** | **3,510 MB (3.51 GB)** |
38
+ | **Size Reduction** | **3,162 MB (47.4%)** |
39
+ | **Compression Ratio** | **1.90x** |
 
 
40
 
41
+ ## Quick Start (Standalone - No Original Model Needed!)
42
+
43
+ ### Installation
44
+
45
+ ```bash
46
+ pip install torch transformers safetensors accelerate pillow
47
+ ```
48
+
49
+ ### Simple Loading (Recommended)
50
 
 
51
  ```python
52
+ import torch
53
+ from transformers import AutoTokenizer, AutoModel
54
 
55
+ # Device setup
56
+ device = "cuda" if torch.cuda.is_available() else "cpu"
 
 
 
 
57
 
58
+ # Load model and tokenizer directly - all files included!
59
  tokenizer = AutoTokenizer.from_pretrained(
60
  "SamMikaelson/deepseek-ocr-mbq-w4bit",
61
  trust_remote_code=True
62
  )
63
 
64
+ model = AutoModel.from_pretrained(
65
+ "SamMikaelson/deepseek-ocr-mbq-w4bit",
66
+ trust_remote_code=True,
67
+ torch_dtype=torch.bfloat16
68
+ )
69
+
70
+ # Load the quantized weights using the helper
71
+ from load_mbq_model import load_mbq_model
72
+ state_dict = load_mbq_model("./") # Assumes files are in current directory
73
+
74
+ model.load_state_dict(state_dict)
75
+ model = model.to(device).eval()
76
+
77
+ print("✅ Model loaded successfully!")
78
  ```
79
 
80
+ ### Manual Loading with Dequantization
81
+
82
  ```python
83
  import torch
84
+ from transformers import AutoTokenizer, AutoModel
85
+ from safetensors.torch import load_file
86
 
87
+ device = "cuda" if torch.cuda.is_available() else "cpu"
 
88
 
89
+ # Load tokenizer
90
+ tokenizer = AutoTokenizer.from_pretrained(
91
+ "SamMikaelson/deepseek-ocr-mbq-w4bit",
92
+ trust_remote_code=True
93
+ )
94
+
95
+ # Load quantized weights
96
+ state_dict = load_file("model.safetensors")
97
+
98
+ # Separate weights and scales
99
+ weights = {}
100
+ scales = {}
101
+
102
+ for name, param in state_dict.items():
103
+ if '.scale' in name:
104
+ scales[name.replace('.scale', '')] = param
105
+ else:
106
+ weights[name] = param
107
+
108
+ # Dequantize weights
109
+ dequantized_state_dict = {}
110
+ for name, param in weights.items():
111
+ if name in scales:
112
+ scale = scales[name]
113
+ dequantized = (param.float() * scale).to(torch.bfloat16)
114
+ dequantized_state_dict[name] = dequantized
115
+ else:
116
+ dequantized_state_dict[name] = param
117
+
118
+ # Load model architecture (included in this repo!)
119
+ model = AutoModel.from_pretrained(
120
+ "SamMikaelson/deepseek-ocr-mbq-w4bit",
121
+ trust_remote_code=True,
122
+ torch_dtype=torch.bfloat16
123
+ )
124
+
125
+ # Load the quantized weights
126
+ model.load_state_dict(dequantized_state_dict)
127
+ model = model.to(device).eval()
128
+
129
+ print("✅ Model loaded successfully!")
130
  ```
131
 
132
  ## Model Files
133
 
134
+ ### Core Files
135
+ - **model.safetensors** (3.51 GB): Quantized model weights (int8 + scales)
136
+ - **load_mbq_model.py**: Helper script for loading
137
+
138
+ ### Architecture Files (from original model)
139
+ - **modeling_deepseekocr.py**: Main model architecture
140
+ - **modeling_deepseekv2.py**: DeepSeek V2 backbone
141
+ - **configuration_deepseek_v2.py**: Model configuration
142
+ - **deepencoder.py**: Vision encoder
143
+ - **conversation.py**: Conversation utilities
144
+ - **processor_config.json**: Processor configuration
145
+
146
+ ### Tokenizer & Config
147
+ - **tokenizer.json**: Tokenizer vocabulary
148
+ - **tokenizer_config.json**: Tokenizer configuration
149
  - **config.json**: Model configuration
150
+ - **special_tokens_map.json**: Special tokens
 
151
 
152
+ ### Metadata
153
+ - **quantization_metadata.json**: Quantization details
154
+ - **quantization_report.json**: Compression statistics
155
 
156
+ ## Advantages
157
+
158
+ ✅ **Standalone**: All files included, no need to download original model
159
+ ✅ **Smaller Size**: 47% reduction in model size
160
+ ✅ **Easy Loading**: Simple AutoModel.from_pretrained() with trust_remote_code=True
161
+ ✅ **Compatible**: Works with standard transformers library
162
+ ✅ **Preserved Quality**: Mixed-precision maintains model performance
 
 
163
 
164
  ## MBQ Methodology
165
 
166
  MBQ (Mixed-precision post-training quantization) intelligently allocates different bit-widths to layers based on their sensitivity:
167
 
168
+ 1. **Sensitivity Analysis**: Computes sensitivity scores using Hessian approximation
169
+ 2. **Mixed Precision**: High-sensitivity layers (top 15%) → 8-bit, others → 4-bit
170
  3. **Symmetric Quantization**: Efficient quantization scheme for weights and activations
171
+ 4. **Storage**: Weights stored as int8 with separate scale factors for true compression
172
 
173
  ## Performance
174
 
175
+ - **Memory Usage**: Reduced by 47.4%
176
+ - **Model Size**: From 6.67 GB to 3.51 GB
177
+ - **Standalone**: No dependency on original model repo ✅
178
+ - **Inference**: Lower memory footprint, faster loading
 
 
 
 
 
 
179
 
180
  ## Citation
181
 
 
203
 
204
  ## License
205
 
206
+ MIT License (same as the base model)
207
+
208
+ ## Troubleshooting
209
+
210
+ If you encounter issues loading the model:
211
+
212
+ 1. Ensure `trust_remote_code=True` is set
213
+ 2. Install required packages: `pip install -r requirements.txt`
214
+ 3. Check that you're using transformers >= 4.40.0
215
+ 4. Use the provided `load_mbq_model.py` helper script
216
+
217
+ For questions or issues, please open an issue on the model repository.