lvyufeng
/

DeepSeek-OCR

@@ -49,6 +49,49 @@ The official version of DeepSeek-OCR has limited the transformers version to 4.4
 Feel free to opt for various attention implementations such as Flash Attention or SDPA to leverage the latest optimizations in transformers for a performance boost.
 ## MindSpore Usage
 Inference using Huggingface transformers on Ascend NPUs. Requirements tested on MindSpore2.7+ CANN8.2：
@@ -74,6 +117,9 @@ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 model = AutoModel.from_pretrained(model_name, dtype=mindspore.float16, _attn_implementation='sdpa', trust_remote_code=True, use_safetensors=True, device_map='auto')
 model = model.eval()
 # prompt = "<image>\nFree OCR. "
 prompt = "<image>\n<|grounding|>Convert the document to markdown. "
 image_file = 'your_image.jpg'
@@ -114,6 +160,10 @@ model_name = 'lvyufeng/DeepSeek-OCR'
 tokenizer = AutoTokenizer.from_pretrained(model_name, dtype=torch.bfloat16,trust_remote_code=True, device_map='auto')
 model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
 model = model.eval()
 # prompt = "<image>\nFree OCR. "
 prompt = "<image>\n<|grounding|>Convert the document to markdown. "
 image_file = 'your_image.jpg'

 Feel free to opt for various attention implementations such as Flash Attention or SDPA to leverage the latest optimizations in transformers for a performance boost.
+## Combined MoE
+In Transformer-based Mixture-of-Experts (MoE) models, the conventional approach relies on an MoE gating module to select experts, followed by processing hidden states through iterative loops. This often results in host-bound comparisons, which can significantly slow down token generation—especially on Ascend hardware.
+To address this, we introduce a method that consolidates the MoE layer into three unified weight matrices (up, down, and gate_proj). This design is particularly suitable for smaller MoE models that can be fully loaded into memory. Below is the key implementation:
+```python
+# combine weights of expert befor inference:
+for layer in self.model.layers:
+    if isinstance(layer.mlp, DeepseekV2MoE):
+        moe_layer = layer.mlp
+        # combine experts
+        moe_layer.w1 = nn.Parameter(torch.stack([moe_layer.experts[i].gate_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
+        moe_layer.w2 = nn.Parameter(torch.stack([moe_layer.experts[i].down_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
+        moe_layer.w3 = nn.Parameter(torch.stack([moe_layer.experts[i].up_proj.weight.T for i in range(moe_layer.config.n_routed_experts)]), requires_grad=False)
+# patch the new forward method of DeepseekV2MoE
+def new_forward_for_moe(self, hidden_states):
+    batch_size, sequence_length, hidden_dim = hidden_states.shape
+    selected_experts, routing_weights = self.gate(hidden_states)
+    router_scores = torch.zeros(size=(batch_size * sequence_length, self.config.n_routed_experts), device=hidden_states.device, dtype=hidden_states.dtype)
+    # we cast back to the input dtype
+    routing_weights = routing_weights.to(hidden_states.dtype)
+    router_scores = torch.scatter_add(router_scores, -1, selected_experts, routing_weights)
+    hidden_states = hidden_states.view(-1, hidden_dim)
+    if self.config.n_shared_experts is not None:
+        shared_expert_output = self.shared_experts(hidden_states)
+    hidden_w1 = torch.matmul(hidden_states, self.w1)
+    hidden_w3 = torch.matmul(hidden_states, self.w3)
+    hidden_states = self.act(hidden_w1) * hidden_w3
+    hidden_states = torch.bmm(hidden_states, self.w2) * torch.transpose(router_scores, 0, 1).unsqueeze(-1)
+    final_hidden_states = hidden_states.sum(dim=0, dtype=hidden_states.dtype)
+    if self.config.n_shared_experts is not None:
+        hidden_states = final_hidden_states + shared_expert_output
+    return hidden_states.view(batch_size, sequence_length, hidden_dim)
+```
+As a result, we achieve a 3–4x speedup in OCR text generation. This dramatic improvement makes the optimized model a game-changer for production environments.
 ## MindSpore Usage
 Inference using Huggingface transformers on Ascend NPUs. Requirements tested on MindSpore2.7+ CANN8.2：
 model = AutoModel.from_pretrained(model_name, dtype=mindspore.float16, _attn_implementation='sdpa', trust_remote_code=True, use_safetensors=True, device_map='auto')
 model = model.eval()
+# combine experts
+model.combine_moe()
 # prompt = "<image>\nFree OCR. "
 prompt = "<image>\n<|grounding|>Convert the document to markdown. "
 image_file = 'your_image.jpg'
 tokenizer = AutoTokenizer.from_pretrained(model_name, dtype=torch.bfloat16,trust_remote_code=True, device_map='auto')
 model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
 model = model.eval()
+# combine experts
+model.combine_moe()
 # prompt = "<image>\nFree OCR. "
 prompt = "<image>\n<|grounding|>Convert the document to markdown. "
 image_file = 'your_image.jpg'