Make compatible with newer transformers

#38

by harpreetsahota - opened Oct 23, 2025

Oct 23, 2025

Issue

The model fails to load with new Transformers versions due to removed classes:

ImportError: cannot import name 'LlamaFlashAttention2' from 'transformers.models.llama.modeling_llama'

Root Cause

In modeling_deepseekv2.py (lines 37-39), the code imports:

from transformers.models.llama.modeling_llama import (
    LlamaAttention,
    LlamaFlashAttention2
)

These classes were removed in Transformers 4.47+ as part of the attention refactoring.

Proposed Fix

Since DeepSeek-OCR uses MLA (Multi-head Latent Attention) by default (config.use_mla = True), the Llama attention classes are only used as fallbacks for MHA mode.

Option 1: Remove MHA support (simplest)

Remove the imports (lines 37-39)
Update ATTENTION_CLASSES dict (lines 1022-1029):

ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
    # Removed mha_eager and mha_flash_attention_2
}

Option 2: Use DeepSeek attention for MHA mode (backward compatible)

Keep the same keys but map to DeepSeek classes:

ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
    "mha_eager": DeepseekV2Attention,  # Changed
    "mha_flash_attention_2": DeepseekV2FlashAttention2,  # Changed
}

Option 3: Conditional import (most flexible)

try:
    from transformers.models.llama.modeling_llama import (
        LlamaAttention,
        LlamaFlashAttention2
    )
    HAS_LLAMA_ATTENTION = True
except ImportError:
    HAS_LLAMA_ATTENTION = False

ATTENTION_CLASSES = {
    "eager": DeepseekV2Attention,
    "flash_attention_2": DeepseekV2FlashAttention2,
    "mla_eager": DeepseekV2Attention,
    "mla_flash_attention_2": DeepseekV2FlashAttention2,
}

if HAS_LLAMA_ATTENTION:
    ATTENTION_CLASSES.update({
        "mha_eager": LlamaAttention,
        "mha_flash_attention_2": LlamaFlashAttention2
    })
else:
    ATTENTION_CLASSES.update({
        "mha_eager": DeepseekV2Attention,
        "mha_flash_attention_2": DeepseekV2FlashAttention2
    })

This works because DeepSeek-OCR uses MLA by default anyway.

bigpappic

Oct 23, 2025

All the same issues are still there nothing will open and nothing works this is a useless app fix it

laxmareddyp

Oct 24, 2025

•

edited Oct 24, 2025

yayoimizuha

Oct 29, 2025

mingyi456

Oct 29, 2025

@harpreetsahota Does it really use MLA by default? Over here it says "use_mla": false, and mapping "mha_flash_attention_2" to DeepseekV2FlashAttention2 still does not work for me, but I am not sure if it is an unrelated issue.

prithivMLmods

Nov 4, 2025

•

edited Nov 4, 2025

@bigpappic @mingyi456 @laxmareddyp @yayoimizuha
Hey guys! Hope this post helps with the compatibility issues.

Post: https://huggingface.co/posts/prithivMLmods/374605520852651
Demo: https://huggingface.co/spaces/prithivMLmods/DeepSeek-OCR-experimental

transformers==4.57.1
torch
einops
addict
easydict
matplotlib

import os
import torch
import requests
from transformers import AutoModel, AutoTokenizer
from typing import Iterable

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "prithivMLmods/DeepSeek-OCR-Latest-BF16.I64" # - (https://huggingface.co/deepseek-ai/DeepSeek-OCR)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
).to(device).eval()

Colab: https://huggingface.co/datasets/strangervisionhf/model-infer-test/blob/main/DeepSeek_OCR_Latest_BF16_I64.ipynb

dhclly

Feb 12

@harpreetsahota Does it really use MLA by default? Over here it says "use_mla": false, and mapping "mha_flash_attention_2" to DeepseekV2FlashAttention2 still does not work for me, but I am not sure if it is an unrelated issue.

it default use flash_attention_2?

the deepseek ocr sample code in github

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

mingyi456

Feb 12

it default use flash_attention_2?

the deepseek ocr sample code in github

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

MLA and FA2 are independent of each other.

dhclly

Feb 12

•

edited Feb 12

add monkey patch may helpful,but exist other error...，the best way gpt tell me is edit the remote code and edit the model config.json to use local code

from transformers.models.llama import modeling_llama

if not hasattr(modeling_llama, "LlamaFlashAttention2"):
    class LlamaFlashAttention2(modeling_llama.LlamaAttention):
        pass
    modeling_llama.LlamaFlashAttention2 = LlamaFlashAttention2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment