Make compatible with newer transformers
Issue
The model fails to load with new Transformers versions due to removed classes:
ImportError: cannot import name 'LlamaFlashAttention2' from 'transformers.models.llama.modeling_llama'
Root Cause
In modeling_deepseekv2.py (lines 37-39), the code imports:
from transformers.models.llama.modeling_llama import (
LlamaAttention,
LlamaFlashAttention2
)
These classes were removed in Transformers 4.47+ as part of the attention refactoring.
Proposed Fix
Since DeepSeek-OCR uses MLA (Multi-head Latent Attention) by default (config.use_mla = True), the Llama attention classes are only used as fallbacks for MHA mode.
Option 1: Remove MHA support (simplest)
- Remove the imports (lines 37-39)
- Update
ATTENTION_CLASSESdict (lines 1022-1029):
ATTENTION_CLASSES = {
"eager": DeepseekV2Attention,
"flash_attention_2": DeepseekV2FlashAttention2,
"mla_eager": DeepseekV2Attention,
"mla_flash_attention_2": DeepseekV2FlashAttention2,
# Removed mha_eager and mha_flash_attention_2
}
Option 2: Use DeepSeek attention for MHA mode (backward compatible)
Keep the same keys but map to DeepSeek classes:
ATTENTION_CLASSES = {
"eager": DeepseekV2Attention,
"flash_attention_2": DeepseekV2FlashAttention2,
"mla_eager": DeepseekV2Attention,
"mla_flash_attention_2": DeepseekV2FlashAttention2,
"mha_eager": DeepseekV2Attention, # Changed
"mha_flash_attention_2": DeepseekV2FlashAttention2, # Changed
}
Option 3: Conditional import (most flexible)
try:
from transformers.models.llama.modeling_llama import (
LlamaAttention,
LlamaFlashAttention2
)
HAS_LLAMA_ATTENTION = True
except ImportError:
HAS_LLAMA_ATTENTION = False
ATTENTION_CLASSES = {
"eager": DeepseekV2Attention,
"flash_attention_2": DeepseekV2FlashAttention2,
"mla_eager": DeepseekV2Attention,
"mla_flash_attention_2": DeepseekV2FlashAttention2,
}
if HAS_LLAMA_ATTENTION:
ATTENTION_CLASSES.update({
"mha_eager": LlamaAttention,
"mha_flash_attention_2": LlamaFlashAttention2
})
else:
ATTENTION_CLASSES.update({
"mha_eager": DeepseekV2Attention,
"mha_flash_attention_2": DeepseekV2FlashAttention2
})
This works because DeepSeek-OCR uses MLA by default anyway.
All the same issues are still there nothing will open and nothing works this is a useless app fix it
@harpreetsahota Does it really use MLA by default? Over here it says "use_mla": false, and mapping "mha_flash_attention_2" to DeepseekV2FlashAttention2 still does not work for me, but I am not sure if it is an unrelated issue.
@bigpappic @mingyi456 @laxmareddyp @yayoimizuha
Hey guys! Hope this post helps with the compatibility issues.
Post: https://huggingface.co/posts/prithivMLmods/374605520852651
Demo: https://huggingface.co/spaces/prithivMLmods/DeepSeek-OCR-experimental
transformers==4.57.1
torch
einops
addict
easydict
matplotlib
import os
import torch
import requests
from transformers import AutoModel, AutoTokenizer
from typing import Iterable
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "prithivMLmods/DeepSeek-OCR-Latest-BF16.I64" # - (https://huggingface.co/deepseek-ai/DeepSeek-OCR)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
).to(device).eval()
@harpreetsahota Does it really use MLA by default? Over here it says
"use_mla": false, and mapping"mha_flash_attention_2"toDeepseekV2FlashAttention2still does not work for me, but I am not sure if it is an unrelated issue.
it default use flash_attention_2?
the deepseek ocr sample code in github
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
it default use
flash_attention_2?the deepseek ocr sample code in github
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True) model = model.eval().cuda().to(torch.bfloat16)
MLA and FA2 are independent of each other.
add monkey patch may helpful,but exist other error...,the best way gpt tell me is edit the remote code and edit the model config.json to use local code
from transformers.models.llama import modeling_llama
if not hasattr(modeling_llama, "LlamaFlashAttention2"):
class LlamaFlashAttention2(modeling_llama.LlamaAttention):
pass
modeling_llama.LlamaFlashAttention2 = LlamaFlashAttention2