File size: 19,239 Bytes

---
datasets:
- NeelNanda/pile-10k
base_model:
- deepseek-ai/DeepSeek-R1



---

## Model Details

This model is an int2 model with group_size  64 and symmetric quantization of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.  Some layers are fallback to 4/16 bits. Refer to  Section "Generate the model" for more details of mixed bits setting.

Please follow the license of the original model. This model could **NOT** run on other severing frameworks. 

## Accuracy

For IN2-mixed, we evaluate it on CUDA with overflow protection. The CPU version is expected to be more accurate.

|               | BF16   | INT2-mixed |
| ------------- | ------ | ---------- |
| mmlu          | 0.8514 | 0.8302     |
| arc_challenge | 0.6212 | 0.6084     |
| hellaswag     | 0.6935 | 0.6657     |
| winogrande    | 0.7932 | 0.7940     |

## How To Use

### INT2 Inference on CUDA(4X80G)

please note int2 **may be slower** than int4 on CUDA due to kernel issue.


~~~python

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

quantized_model_dir = "OPEA/DeepSeek-R1-int2-mixed-sym-inc"

## directly use device_map='auto' if you have enough GPUs
device_map = {"model.norm": 0, "lm_head": 0, "model.embed_tokens": 0}
for i in range(61):
    name = "model.layers." + str(i)
    if i < 15:
        device_map[name] = 0
    elif i < 30:
        device_map[name] = 1
    elif i < 45:
        device_map[name] = 2
    else:
        device_map[name] = 3

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    device_map=device_map,
)

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么“",
    "How many e in word deepseek",
    "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=500,  ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)

~~~
### INT2 Inference on CPU

Requirements  

~~~bash
pip install auto-round
pip uninstall intel-extension-for-pytorch
pip install intel-extension-for-transformers
~~~

**It would be quite slow if the cpu does not support avx512**



~~~python
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig ##must import for auto-round format

#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

import torch

quantized_model_dir = "OPEA/DeepSeek-R1-int2-mixed-sym-inc"


quantization_config = AutoRoundConfig(
    backend="cpu",
)
model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cpu",
    quantization_config=quantization_config
)



tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么“",
    "How many e in word deepseek",
    "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,  ##change this to align with the official usage
    num_return_sequences=1,
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)
    
"""
Prompt: 9.11和9.8哪个数字大
Generated: <think>
首先，我需要比较两个数字：9.11和9.8。

首先，比较整数部分。两个数字的整数部分都是9，所以整数部分相同。

接下来，比较小数部分。9.11的小数部分是0.11，而9.8的小数部分是0.8。

由于0.8大于0.11，因此9.8的小数部分更大。

综合整数部分和小数部分的结果，可以确定9.8大于9.11。
</think>

要比较两个数字 **9.11** 和 **9.8** 的大小，可以按照以下步骤进行：

1. **比较整数部分**：
   - 两个数字的整数部分都是 **9**，因此整数部分相同。

2. **比较小数部分**：
   - **9.11** 的小数部分是 **0.11**
   - **9.8** 的小数部分是 **0.8**（即 **0.80**）

3. **比较小数部分的大小**：
   - **0.80** 大于 **0.11**，因此 **9.8** 的小数部分更大。

**结论**：由于整数部分相同且 **9.8** 的小数部分更大，因此 **9.8** 大于 **9.11**。

\boxed{9.8}

--------------------------------------------------
Prompt: 如果你是人，你最想做什么“
Generated: <think>
嗯，如果我是人，我最想做什么呢？这个问题挺有意思的。首先，我需要理解“人”在这里指的是什么。可能是指拥有自主意识、情感和自由意志的人类。那么，作为一个人，我最想做的事情可能和人类通常追求的东西有关，比如幸福、成就、人际关系、自我实现等等。

首先，幸福可能是大多数人的追求。所以，如果我是人，我可能会追求让自己感到快乐和满足的事情。这可能包括从事自己喜欢的活动，比如艺术、音乐、运动，或者帮助他人，因为帮助他人也能带来满足感。

其次，自我实现也是一个重要的方面。根据马斯洛的需求层次理论，自我实现是最高层次的需求，指的是实现个人潜能、追求个人成长和高峰体验。所以，如果我是人，我可能会追求在某个领域达到卓越，比如成为科学家、艺术家、作家，或者在其他专业领域有所建树。

另外，人际关系也是人类生活的重要组成部分。建立和维护亲密的关系，如家庭、朋友和伴侣关系，可能会是重要的目标。作为人，我可能会努力培养这些关系，寻找爱和归属感。

还有，探索和好奇心也是人类的驱动力。如果我是人，可能会对世界充满好奇，想要探索不同的文化、科学、技术，或者旅行到不同的地方，体验不同的生活方式。

不过，这些想法可能受到个人价值观、文化背景和生活经历的影响。不同的人可能有不同的优先事项。例如，有些人可能更注重物质财富，而另一些人则更重视精神层面的满足。此外，个人的兴趣和才能也会影响他们的选择。比如，一个对技术感兴趣的人可能会追求科技创新，而一个热爱艺术的人可能会投身于创作。

另外，作为人，可能还会面临各种挑战和困难，比如经济压力、健康问题、社会压力等。因此，在追求目标的过程中，也需要应对这些挑战，这可能涉及到解决问题、适应变化和克服障碍。

还有，道德和伦理也是需要考虑的因素。作为人，可能会思考自己的行为对他人和社会的影响，努力做出符合道德的选择，比如环保、社会正义等。

总结一下，如果我是人，我最想做的事情可能包括追求幸福、自我实现、建立有意义的人际关系、探索世界、应对挑战，并遵循道德准则。当然，具体的目标和愿望可能会因人而异，取决于个人的价值观、兴趣和环境。
</think>

如果我是人，我最想做的事情将围绕以下几个核心方面展开：

1. **

--------------------------------------------------

Prompt: How many e in word deepseek
Generated: <think>
Okay, so I need to figure out how many times the letter 'e' appears in the word "deepseek". Let me start by writing down the word to visualize it better. The word is D, E, E, P, S, E, E, K. Let me count each letter one by one.

Starting with the first letter: D. That's not an 'e', so I can skip that. The next letter is E. That's one 'e'. The third letter is another E, so that's two. Then comes P, which isn't an 'e'. After P is S, also not an 'e'. Then the next letter is E again, making it three. The following letter is another E, bringing the count to four. Finally, the last letter is K, which isn't an 'e'.

Wait, let me check again to make sure I didn't miss any. D, E, E, P, S, E, E, K. So positions 2, 3, 6, and 7 are all 'e's. That's four in total. Hmm, but sometimes people might get confused if there are uppercase or lowercase letters, but since the word is written as "deepseek", all letters are lowercase, so that shouldn't be an issue.

I think that's it. The letter 'e' appears four times in "deepseek".
</think>

The letter 'e' appears 4 times in the word "deepseek".

**Step-by-Step Explanation:**

1. **Write out the word:** D, E, E, P, S, E, E, K.
2. **Identify each letter and count the 'e's:**
   - Position 1: D (not 'e')
   - Position 2: E (count = 1)
   - Position 3: E (count = 2)
   - Position 4: P (not 'e')
   - Position 5: S (not 'e')
   - Position 6: E (count = 3)
   - Position 7: E (count = 4)
   - Position 8: K (not 'e')
3. **Total count of 'e's:** 4

**Answer:** There are 4 'e's in the word "deepseek".

--------------------------------------------------
Prompt: There are ten birds in a tree. A hunter shoots one. How many are left in the tree?
Generated: <think>
Okay, so there's this problem here: "There are ten birds in a tree. A hunter shoots one. How many are left in the tree?" Hmm, at first glance, it seems straightforward, but I remember sometimes these kinds of questions have a trick to them. Let me think through this step by step.

Alright, starting with the basics. There are ten birds in the tree. If a hunter shoots one, the immediate thought is to subtract one from ten, which would leave nine birds. But wait, maybe there's more to it. I've heard similar riddles where the answer isn't just a simple subtraction. For example, sometimes when a bird is shot, the other birds might fly away because of the noise. So, if the hunter shoots one bird, the rest might get scared and leave the tree. In that case, there would be zero birds left. But the problem doesn't explicitly say that the other birds fly away. It just says the hunter shoots one. So, do we assume the others stay, or do we assume they flee?

Let me consider both possibilities. If we take the problem literally, the hunter shoots one bird, so that one is either dead or injured and falls out of the tree. The remaining birds would then be ten minus one, which is nine. But if the gunshot scares the other birds, they might all fly away immediately. In that scenario, even though the hunter only shot one, the rest are startled and leave, resulting in zero birds remaining.

Which interpretation is correct? The problem doesn't specify whether the other birds are scared by the gunshot. It's a bit ambiguous. In typical riddles like this, the answer is often zero because the noise would cause the other birds to fly off. But if we're being strictly mathematical and not considering the behavior of the birds, it would be nine. However, since this is presented as a riddle, it's more likely expecting the answer that considers the behavior of the birds, leading to zero remaining.

Wait, but let me check if there's another angle. Maybe the question is testing something else. For example, if the hunter shoots one bird, but misses, then all the birds might stay. But the problem says the hunter shoots one, which implies that the shot was successful. So the bird is hit. If the hunter hits one bird, that bird is

"""

~~~



### Generate the model

**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-R1-bf16

~~~python
import safetensors
from safetensors.torch import save_file
 
for i in range(1, 164):
    idx_str = "0" * (5-len(str(i))) + str(i)
    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
    print(safetensors_path)
    tensors = dict()
    with safetensors.safe_open(safetensors_path, framework="pt") as f:
        for key in f.keys():
            tensors[key] = f.get_tensor(key)
    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
~~~



**2 remove torch.no_grad** in  modeling_deepseek.py  as we need some tuning in AutoRound. 

https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py


5*80g and 1.4T-1.6T memory is required

~~~python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers

#  https://github.com/huggingface/transformers/pull/35493
def set_initialized_submodules(model, state_dict_keys):
    """
    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
    dict.
    """
    state_dict_keys = set(state_dict_keys)
    not_initialized_submodules = {}
    for module_name, module in model.named_modules():
        if module_name == "":
            # When checking if the root module is loaded there's no need to prepend module_name.
            module_keys = set(module.state_dict())
        else:
            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
        if module_keys.issubset(state_dict_keys):
            module._is_hf_initialized = True
        else:
            not_initialized_submodules[module_name] = module
    return not_initialized_submodules


transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules

model_name = "opensourcerelease/DeepSeek-R1-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")

block = model.model.layers
device_map = {}

for n, m in block.named_modules():
    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
            device = "cuda:1"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
                n.split('.')[-2]) < 128:
            device = "cuda:2"
        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
                n.split('.')[-2]) < 192:
            device = "cuda:3"
        elif "experts" in n and ("shared_experts" not in n) and int(
                n.split('.')[-2]) >= 192:
            device = "cuda:4"
        else:
            device = "cuda:0"
        n = n[2:]

        device_map.update({n: device})

from auto_round import AutoRound

layer_config = {}
for n, m in model.named_modules():
    if not isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
        continue
    if not "experts" in n:
        layer_config[n] = {"bits": 4, "group_size": 128}
    if "experts" in n and "shared_experts" in n:
        layer_config[n] = {"bits": 4, "group_size": 128}
    ##handle first 3 layers
    name_splits = n.split('.')
    if len(name_splits) >= 3 and int(name_splits[2]) < 3:
        layer_config[n] = {"bits": 4, "group_size": 128}
    if len(name_splits) >= 3 and int(name_splits[2]) == 60 and "down_proj" in n:
        layer_config[n] = {"bits": 16}


layer_config["lm_head"] = {"bits": 16}
autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, bits=2, group_size=64,
                      iters=400, batch_size=4, seqlen=512, nsamples=512, enable_torch_compile=False,
                      layer_config=layer_config)
autoround.quantize()
autoround.save_quantized(format="auto_round", output_dir="tmp_autoround")

~~~



## Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

## Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)

## Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

## Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)