File size: 5,281 Bytes
033f9f3
 
 
 
 
 
 
 
 
 
ee7aff1
033f9f3
 
 
 
9d26c1d
7af3989
9d26c1d
 
99e1ac8
ee7aff1
3b7af59
 
4398010
 
94e62a8
4398010
 
 
17cf515
4398010
 
17cf515
4398010
93bff76
4398010
17cf515
4398010
17cf515
4398010
 
17cf515
4398010
 
94e62a8
4398010
 
 
94e62a8
4398010
 
 
 
 
 
 
 
 
 
 
0a76fab
033f9f3
94e62a8
edbdc2b
ee7aff1
033f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
94e62a8
033f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94e62a8
033f9f3
 
 
 
 
 
 
 
edbdc2b
 
 
35663be
 
94e62a8
35663be
 
94e62a8
d644917
 
 
35663be
 
 
 
 
 
d18b114
94e62a8
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
library_name: transformers
license: llama3.1
tags:
- deepseek
- transformers
- llama
- llama-3
- meta
- GGUF
---

# DeepSeek-R1-Distill-Llama-8B-NexaQuant

<div align="center">
  <img src="banner.png" width="80%" alt="NexaQuant" />
</div>

## Introduction

DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run it locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the benefits of the local reasoning model.

We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original file size—without losing any accuracy. Tests on an **HP Omnibook AIPC** with an **AMD Ryzen™ AI 9 HX 370 processor** showed a decoding speed of **17.20 tokens per second** and a peak RAM usage of just **5017 MB** in NexaQuant version—compared to only **5.30 tokens** per second and **15564 MB RAM** in the unquantized version—while NexaQuant **maintaining full precision model accuracy.**

## NexaQuant Use Case Demo

Here’s a comparison of how a standard Q4_K_M and NexaQuant-4Bit handle a common investment banking brain teaser question. NexaQuant excels in accuracy while shrinking the model file size by 4 times.


Prompt: A Common Investment Banking BrainTeaser Question

A stick is broken into 3 parts, by choosing 2 points randomly along its length. With what probability can it form a triangle?

Right Answer: 1/4


<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/jOtgsAnr6nttS0mnu0snZ.png" width="80%" alt="Example" />
</div>


## Benchmarks

The benchmarks show that NexaQuant’s 4-bit model preserves the reasoning capacity of the original 16-bit model, delivering uncompromised performance in a significantly smaller memory & storage footprint.

**Reasoning Capacity:**
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/pJzYVGTdWWvLn2MJtsD_d.png" width="80%" alt="Example" />
</div>

**General Capacity:**

| Benchmark                  | Full 16-bit | llama.cpp (4-bit) | NexaQuant (4-bit)|
|----------------------------|------------|-------------------|-------------------|
| **HellaSwag**              | 57.07      | 52.12             | 54.56             |
| **MMLU**                   | 55.59      | 52.82             | 54.94             |
| **ARC Easy**               | 74.49      | 69.32             | 71.72             |
| **MathQA**                 | 35.34      | 30.00             | 32.46             |
| **PIQA**                   | 78.56      | 76.09             | 77.68             |
| **IFEval**                 | 36.26      | 35.35             | 34.12             |

## Run locally

NexaQuant is compatible with **Nexa-SDK**, **Ollama**, **LM Studio**, **Llama.cpp**, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.

#### Option 1: Using Nexa SDK

**Step 1: Install Nexa SDK**

Follow the installation instructions in Nexa SDK's [GitHub repository](https://github.com/NexaAI/nexa-sdk).

**Step 2: Run the model with Nexa**

Execute the following command in your terminal:
```bash
nexa run DeepSeek-R1-Distill-Llama-8B-NexaQuant:q4_0
```


#### Option 2: Using llama.cpp

**Step 1: Build llama.cpp on Your Device**

Follow the "Building the project" instructions in the llama.cpp [repository](https://github.com/ggerganov/llama.cpp) to build the project.

**Step 2: Run the Model with llama.cpp**

Once built, run `llama-cli` under `<build_dir>/bin/`:
```bash
./llama-cli \
    --model your/local/path/to/DeepSeek-R1-Distill-Llama-8B-NexaQuant \
    --prompt 'Provide step-by-step reasoning enclosed in <think> </think> tags, followed by the final answer enclosed in \boxed{} tags.' \
```


#### Option 3: Using LM Studio

**Step 1: Download and Install LM Studio**

Get the latest version from the [official website](https://lmstudio.ai/).

**Step 2: Load and Run the Model**

1. In LM Studio's top panel, search for and select `NexaAIDev/DeepSeek-R1-Distill-Llama-8B-NexaQuant`.  
2. Click `Download` (if not already downloaded) and wait for the model to load.  
3. Once loaded, go to the chat window and start a conversation.
---


## What's next

1. This model is built for complex problem-solving, which is why it sometimes takes a long thinking process even for simple questions. We recognized this and are working on improving it in the next update.
  
2. Inference Nexa Quantized Deepseek-R1 distilled model on NPU
   
### Follow us

If you liked our work, feel free to ⭐Star [Nexa's GitHub Repo](https://github.com/NexaAI/nexa-sdk).

Interested in running DeepSeek R1 on your own devices with optimized CPU, GPU, and NPU acceleration or compressing your finetuned DeepSeek-Distill-R1? [Let’s chat!](https://nexa.ai/book-a-call)

[Blogs](https://nexa.ai/blogs/deepseek-r1-nexaquant) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)

Join Discord server for help and discussion.