OPEA
/

DeepSeek-R1-int4-AutoRound-awq-asym

+---
+datasets:
+- NeelNanda/pile-10k
+base_model:
+- deepseek-ai/DeepSeek-R1
+---
+## Model Details
+This model is an int4 model with group_size 64 and asymmetric quantization of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
+Please follow the license of the original model.
+## How To Use
+**INT4 VLLM Inference on CUDA**(**at least 8*80G**)
+To serve using vLLM with 8x 80GB GPUs, use the following command:
+```sh
+VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-num-batched-tokens 65536 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.97 --dtype bfloat16 --served-model-name deepseek-reasoner --model OPEA/DeepSeek-R1-int4-asym-AutoRound-awq
+```
+You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking [here](https://huggingface.co/x2ray/wheels/resolve/main/vllm-0.7.3.dev187%2Bg0ff1a4df.d20220101.cu126-cp312-cp312-linux_x86_64.whl).
+~~~python
+import requests
+url = "http://localhost:12345/v1/completions"
+headers = {"Content-Type": "application/json"}
+data = {
+    "model": "deepseek-reasoner",
+    "prompt": "请用我给你的4个数字，通过加、减、乘、除、括号，组成一个运算，使得结果为24。注意：数字需要全部使用我提供的数字，4468",
+    "max_tokens": 512
+}
+response = requests.post(url, json=data, headers=headers)
+print(response.json()["choices"][0]["text"])
+"""
+prompt： 一个汉字具有左右结构，左边是木，右边是乞，这个字是什么字，只需要回答这个字即可。
+好
+判断一个字的左右结构中左边是“木”、右边是“乞”的汉字是否存在，需要通过查阅现代汉语词典和汉字结构分析来确认。
+首先，根据字形结构拆分左边的“木”和右边的“乞”，假设存在这样的组合的话，我们需要确认这个字是否被收录进《现代汉语词典》或《新华字典》中。通过初步检索，例如使用在线字典查询工具，“
+  ”字旁的字如“杨”、“林”等，但带有“乞”作为右边部分的字较不常见。
+可能的情形包括：“仡”（yì，左右结构但右边是“乞”，左边是“人”）；或者可能存在生僻字或古代变体，但常规现代汉字中可能不存在。也可能存在输入法误录或地方方言中的特殊写法，但标准规范
+  字中未必收录。
+因此，在标准现代汉字中，似乎没有符合左边是“木”，右边是“乞”的汉字。因此可以初步认为这个字并不存在，或者是某种误写、生僻字、异体字未被广泛接受的情况。需要进一步的专业工具或古籍
+  献查证是否曾有该字存在。如果按照要求只需要回答这个字，而在不存在的情况下可能需要指出不存在。但按照题目要求可能需要提供一个可能的答案，因此可能需要考虑是不是“杚”（gài，用手取
+  的一种动作，但通常右边部分是否“乞”需要核对具体字形差异）。
+--------------------------------------------------
+prompt: 请用我给你的4个数字，通过加、减、乘、除、括号，组成一个运算，使得结果为24。注意：数字需要全部使用我提供的数字，4468
+ ，请用代码解决这个问题。不要依赖其他人的解法，独自思考给出答案。
+嗯，我现在需要解决这个问题：用数字4、4、6、8，通过加、减、乘、除以及括号，组合成一个表达式，得到结果24。这四个数都必须使用到。我记得怎么解这种24点的问题呢，通常需要尝试不同的
+  算组合，然后检查结果是否正确。
+首先，我会先考虑这四个数是否能直接通过简单的加减乘除组合得到24。比如，先看看是否能将较大的两个数相乘，比如8×6是48，然后看看剩下的两个数如何调整。比如说，48除以（4-4/4），或者
+  似的。不过我这个组合并没有用全四个数字，因为用了两个4，可能是不是正确？
+或者想是否有办法将大的数拆解。比如用乘法将较小的数凑出较大的数。比如，8×(6 - (4/4))。这里用了8×(6 -1)也就是8×5等于40，这不够。
+另一个可能的尝试是：有没有能把其中一个数除以后再相乘得到大的数。比如，6×8×(4/4)。这样的话，6×8=48，4/4=1，48×1=48，不对。或者，6×(8×(4/4))同样不行。
+再试试用减法。比如，8×4 - 6×4，这样的话，得到的是32-24=8，也不对。
+那可能需要混合运算，例如用加法和乘法共同作用。例如，（4×4）+ 8 +6= 16+8+6=30，不够24。那这样的话可能并不合适。
+或者试着用括号，组合运算顺序。比如，先处理4和4，比如4+4=8，那么我们有8、6、8。然后这三个数如何组合？8×(6×(8/(8))?), 这样反而有点像复杂的步骤，似乎难��得出24。
+换一种思路，可能可以使用分数或者除法。比如，6 ÷ ( (4/4) ) × 8？4除以4等于1，6除以1等于6，再乘以8得到48，不够。
+或者（8 - (6/4)) ×4？计算的话，先做6/4=1.5，那么8-1.5=6.5，6.5×4=26，还不够。
+或者考虑减法结合乘法，比如，(8 -4) ×6 + 4×0？这显然不行
+--------------------------------------------------
+prompt: How should I explain the Internet?
+ The most common association people have with the Internet is a quick email reply or a Google search. The Internet was once new and now it is completely normal and in some cases even taken for granted. Therefore, before covering how one can demystify this complex computer network, first let me answer a basic starting question: what exactly is the Internet?
+The Internet is a global network of interconnected computers that communicate through standard protocols. These protocols, such as TCP/IP, are sets of rules that allow devices to exchange data. Think of the Internet as a vast system of roads and highways where information travels instead of cars. Each device, like a computer or smartphone, is like a vehicle that can navigate this network to access websites, send emails, or stream videos. The information is broken into packets, which are like individual carriages on a train, each carrying a piece of data. Routers and servers act as traffic controllers, directing these packets to their destinations efficiently. While the infrastructure is highly technical, the Internet’s purpose is to facilitate communication and information sharing on an unprecedented scale. Its architecture is designed for redundancy and resilience, ensuring that even if parts of the network fail, the rest can continue operating.
+Now that we have a basic understanding of the Internet, let's explore how to explain its components, structure, and functionality in a comprehensible way.
+## Key Components of the Internet
+### 1. Devices and Endpoints
+At the most basic level, the Internet consists of devices like computers, smartphones, and servers. These are the entry and exit points where data originates or is consumed.
+### 2. Internet Service Providers (ISPs)
+ISPs are the companies that provide access to the Internet. They maintain the infrastructure, such as fiber-optic cables, that connects individual devices to the broader network.
+### 3. Data Centers and Servers
+Servers store the data that makes up websites, apps, and services. Data centers are facilities housing numerous servers, often managed by companies like Google, Amazon, or Microsoft.
+### 4. Protocols
+Protocols like TCP/IP (Transmission Control Protocol/Internet Protocol) are the rules governing how data is sent and received. They ensure that different devices can communicate effectively.
+## Structure of the Internet
+### 1. Physical Infrastructure
+The Internet isn't just an abstract concept; it has a physical presence. This includes undersea cables, satellites, routers, and data centers spread across the globe.
+### 2. IP Addresses
+Every device connected to the Internet has a
+"""
+~~~
+### INT4 Inference on CPU
+Requirements
+~~~bash
+pip install auto-round
+pip uninstall intel-extension-for-pytorch
+pip install intel-extension-for-transformers
+~~~
+will update later
+### Evaluate the model
+pip3 install lm-eval==0.4.8
+```bash
+lm-eval --model hf --model_args pretrained=OPEA/DeepSeek-R1-int4-asym-AutoRound-awq   --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,boolq,arc_easy,arc_challenge,mmlu --batch_size 16
+```
+|           Metric           |   FP8   |      INT4(BF16)      |
+| :------------------------ | :---------------------- | :--------------- |
+| avg            | 0.6954           | 0.6963                               |
+| mmlu           | 0.8514           | 0.8485                               |
+| lambada_openai | 0.7902           | 0.7809                               |
+| hellaswag      | 0.6935           | 0.6883                               |
+| winogrande     | 0.7932           | 0.8011                               |
+| piqa           | 0.8308           | 0.8292                               |
+| truthfulqa_mc1 | 0.4064           | 0.4051                               |
+| openbookqa     | 0.3780           | 0.394                                |
+| boolq          | 0.8856           | 0.8813                               |
+| arc_easy       | 0.8598           | 0.8594                               |
+| arc_challenge  | 0.6212           | 0.6271                               |
+### Generate the model
+**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-R1-bf16
+~~~python
+import safetensors
+from safetensors.torch import save_file
+for i in range(1, 164):
+    idx_str = "0" * (5-len(str(i))) + str(i)
+    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
+    print(safetensors_path)
+    tensors = dict()
+    with safetensors.safe_open(safetensors_path, framework="pt") as f:
+        for key in f.keys():
+            tensors[key] = f.get_tensor(key)
+    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
+~~~
+**2 remove torch.no_grad** in  modeling_deepseek.py  as we need some tuning in AutoRound.
+~~~python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import transformers
+#  https://github.com/huggingface/transformers/pull/35493
+def set_initialized_submodules(model, state_dict_keys):
+    """
+    Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
+    dict.
+    """
+    state_dict_keys = set(state_dict_keys)
+    not_initialized_submodules = {}
+    for module_name, module in model.named_modules():
+        if module_name == "":
+            # When checking if the root module is loaded there's no need to prepend module_name.
+            module_keys = set(module.state_dict())
+        else:
+            module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
+        if module_keys.issubset(state_dict_keys):
+            module._is_hf_initialized = True
+        else:
+            not_initialized_submodules[module_name] = module
+    return not_initialized_submodules
+transformers.modeling_utils.set_initialized_submodules = set_initialized_submodules
+model_name = "opensourcerelease/DeepSeek-R1-bf16"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto")
+block = model.model.layers
+device_map = {}
+for n, m in block.named_modules():
+    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
+        if "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) < 63:
+            device = "cuda:1"
+        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 63 and int(
+                n.split('.')[-2]) < 128:
+            device = "cuda:2"
+        elif "experts" in n and ("shared_experts" not in n) and int(n.split('.')[-2]) >= 128 and int(
+                n.split('.')[-2]) < 192:
+            device = "cuda:3"
+        elif "experts" in n and ("shared_experts" not in n) and int(
+                n.split('.')[-2]) >= 192:
+            device = "cuda:4"
+        else:
+            device = "cuda:0"
+        n = n[2:]
+        device_map.update({n: device})
+from auto_round import AutoRound
+autoround = AutoRound(model=model, tokenizer=tokenizer, device_map=device_map, nsamples=512,
+                      batch_size=4, low_gpu_mem_usage=True, seqlen=2048, group_size=64, sym=False
+                      )
+autoround.quantize()
+autoround.save_quantized(format="auto_awq", output_dir="tmp_autoround")
+~~~
+## Ethical Considerations and Limitations
+The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
+Therefore, before deploying any applications of the model, developers should perform safety testing.
+## Caveats and Recommendations
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
+Here are a couple of useful links to learn more about Intel's AI software:
+- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
+## Disclaimer
+The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
+## Cite
+@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
+[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)