Make compatible with newer transformers

#38
by harpreetsahota - opened

Issue

The model fails to load with new Transformers versions due to removed classes:

ImportError: cannot import name 'LlamaFlashAttention2' from 'transformers.models.llama.modeling_llama'

Root Cause

In modeling_deepseekv2.py (lines 37-39), the code imports:

from transformers.models.llama.modeling_llama import (
    LlamaAttention,
    LlamaFlashAttention2
)

These classes were removed in Transformers 4.47+ as part of the attention refactoring.

Proposed Fix

Since DeepSeek-OCR uses MLA (Multi-head Latent Attention) by default (config.use_mla = True), the Llama attention classes are only used as fallbacks for MHA mode.

Option 1: Remove MHA support (simplest)

  1. Remove the imports (lines 37-39)
  2. Update ATTENTION_CLASSES dict (lines 1022-1029):
ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
    # Removed mha_eager and mha_flash_attention_2
}

Option 2: Use DeepSeek attention for MHA mode (backward compatible)

Keep the same keys but map to DeepSeek classes:

ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
    "mha_eager": DeepseekV2Attention,  # Changed
    "mha_flash_attention_2": DeepseekV2FlashAttention2,  # Changed
}

Option 3: Conditional import (most flexible)

try:
    from transformers.models.llama.modeling_llama import (
        LlamaAttention,
        LlamaFlashAttention2
    )
    HAS_LLAMA_ATTENTION = True
except ImportError:
    HAS_LLAMA_ATTENTION = False

ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
}

if HAS_LLAMA_ATTENTION:
    ATTENTION_CLASSES.update({
        "mha_eager": LlamaAttention,
        "mha_flash_attention_2": LlamaFlashAttention2
    })
else:
    ATTENTION_CLASSES.update({
        "mha_eager": DeepseekV2Attention,
        "mha_flash_attention_2": DeepseekV2FlashAttention2
    })

This works because DeepSeek-OCR uses MLA by default anyway.

All the same issues are still there nothing will open and nothing works this is a useless app fix it

@harpreetsahota Does it really use MLA by default? Over here it says "use_mla": false, and mapping "mha_flash_attention_2" to DeepseekV2FlashAttention2 still does not work for me, but I am not sure if it is an unrelated issue.

@bigpappic @mingyi456 @laxmareddyp @yayoimizuha
Hey guys! Hope this post helps with the compatibility issues.

Post: https://huggingface.co/posts/prithivMLmods/374605520852651
Demo: https://huggingface.co/spaces/prithivMLmods/DeepSeek-OCR-experimental

transformers==4.57.1
torch
einops
addict
easydict
matplotlib
import os
import torch
import requests
from transformers import AutoModel, AutoTokenizer
from typing import Iterable

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "prithivMLmods/DeepSeek-OCR-Latest-BF16.I64" # - (https://huggingface.co/deepseek-ai/DeepSeek-OCR)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
).to(device).eval()

Colab: https://huggingface.co/datasets/strangervisionhf/model-infer-test/blob/main/DeepSeek_OCR_Latest_BF16_I64.ipynb

@harpreetsahota Does it really use MLA by default? Over here it says "use_mla": false, and mapping "mha_flash_attention_2" to DeepseekV2FlashAttention2 still does not work for me, but I am not sure if it is an unrelated issue.

it default use flash_attention_2?

the deepseek ocr sample code in github

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

it default use flash_attention_2?

the deepseek ocr sample code in github

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

MLA and FA2 are independent of each other.

add monkey patch may helpful,but exist other error...,the best way gpt tell me is edit the remote code and edit the model config.json to use local code

from transformers.models.llama import modeling_llama

if not hasattr(modeling_llama, "LlamaFlashAttention2"):
    class LlamaFlashAttention2(modeling_llama.LlamaAttention):
        pass
    modeling_llama.LlamaFlashAttention2 = LlamaFlashAttention2

Sign up or log in to comment