File size: 5,722 Bytes

033f9f3
 
 
 
 
 
 
 
 
 
ee7aff1
033f9f3
 
 
 
9d26c1d
7af3989
9d26c1d
 
3b7af59
ee7aff1
3b7af59
 
76285e9
033f9f3
edbdc2b
 
ee7aff1
033f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edbdc2b
 
 
ac3fef2

---
base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
library_name: transformers
license: llama3.1
tags:
- deepseek
- transformers
- llama
- llama-3
- meta
- GGUF
---

# DeepSeek-R1-Distill-Llama-8B-NexaQuant

<div align="center">
  <img src="banner.png" width="80%" alt="NexaQuant" />
</div>

## Background + Overview 

DeepSeek-R1 has been making headlines for rivaling OpenAI’s O1 reasoning model while remaining fully open-source. Many users want to run it locally to ensure data privacy, reduce latency, and maintain offline access. However, fitting such a large model onto personal devices typically requires quantization (e.g. Q4_K_M), which often sacrifices accuracy (up to ~22% accuracy loss) and undermines the benefits of the local reasoning model.

We’ve solved the trade-off by quantizing the DeepSeek R1 Distilled model to one-fourth its original size—without losing any accuracy. This lets you run powerful on-device reasoning wherever you are, with no compromises. Tests on an **HP Omnibook AIPC** with an **AMD Ryzen™ AI 9 HX 370 processor** showed a decoding speed of **17.20 tokens per second** and a peak RAM usage of just **5017 MB** in NexaQuant version—compared to only **5.30 tokens** per second and **15564 MB RAM** in the unquantized version—while **maintaining full precision model accuracy.**

## How to run locally

NexaQuant is compatible with **Nexa-SDK**, **Ollama**, **LM Studio**, **Llama.cpp**, and any llama.cpp based project. Below, we outline multiple ways to run the model locally.

#### Option 1: Using Nexa SDK

**Step 1: Install Nexa SDK**

Follow the installation instructions in Nexa SDK's [GitHub repository](https://github.com/NexaAI/nexa-sdk).

**Step 2: Run the model with Nexa**

Execute the following command in your terminal:
```bash
nexa run DeepSeek-R1-Distill-Llama-8B-NexaQuant:q4_0
```

#### Option 2: Using llama.cpp

**Step 1: Build llama.cpp on Your Device**

Follow the "Building the project" instructions in the llama.cpp [repository](https://github.com/ggerganov/llama.cpp) to build the project.

**Step 2: Run the Model with llama.cpp**

Once built, run `llama-cli` under `<build_dir>/bin/`:
```bash
./llama-cli \
    --model your/local/path/to/DeepSeek-R1-Distill-Llama-8B-NexaQuant \
    --prompt 'Provide step-by-step reasoning enclosed in <think> </think> tags, followed by the final answer enclosed in \boxed{} tags.' \
```

#### Option 3: Using LM Studio

**Step 1: Download and Install LM Studio**

Get the latest version from the [official website](https://lmstudio.ai/).

**Step 2: Load and Run the Model**

1. In LM Studio's top panel, search for and select `NexaAIDev/DeepSeek-R1-Distill-Llama-8B-NexaQuant`.  
2. Click `Download` (if not already downloaded) and wait for the model to load.  
3. Once loaded, go to the chat window and start a conversation.
---

## Example

On the left, we have an example of what LMStudio Q4_K_M responded. On the right is our NexaQuant version. 

Prompt: A Common Investment Banking BrainTeaser Question

There is a 6x8 rectangular chocolate bar made up of small 1x1 bits. We want to break it into the 48 bits. We can break one piece of chocolate horizontally or vertically, but cannot break two pieces together! What is the minimum number of breaks required?

Right Answer: 47

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/ZS9e66t7OhBIno4eQ3OaX.png" width="80%" alt="Example" />
</div>

## Benchmarks

NexaQuant on Reasoning Benchmarks Compared to BF16 and LMStudio's Q4_K_M

**8B:**

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66abfd6f65beb23afa427d8a/4SSDIqxikgaDbx8V-Q6bf.png" width="80%" alt="Example" />
</div>

The general capacity has also greatly improved:

**8B:**

| Benchmark                  | Full 16-bit | llama.cpp (4-bit) | NexaQuant (4-bit)|
|----------------------------|------------|-------------------|-------------------|
| **HellaSwag**              | 57.07      | 52.12             | 54.56             |
| **MMLU**                   | 55.59      | 52.82             | 54.94             |
| **Humanities**             | 50.48      | 44.74             | 50.24             |
| **Social Sciences**        | 65.32      | 65.62             | 64.74             |
| **STEM**                   | 47.73      | 52.50             | 46.75             |
| **ARC Easy**               | 74.49      | 69.32             | 71.72             |
| **MathQA**                 | 35.34      | 30.00             | 32.46             |
| **PIQA**                   | 78.56      | 76.09             | 77.68             |
| **IFEval - Inst - Loose**  | 44.24      | 44.95             | 42.45             |
| **IFEval - Inst - Strict** | 42.57      | 44.95             | 40.05             |
| **IFEval - Prom - Loose**  | 30.31      | 25.74             | 28.47             |
| **IFEval - Prom - Strict** | 27.91      | 25.74             | 25.51             |

## What's next

1. Inference Nexa Quantized Deepseek-R1 distilled model on NPU.

2. This model is designed for complex problem-solving, which is why it has a longer thinking process. We understand this can be an issue in some cases, and we're actively working on improvements.

### Follow us

If you liked our work, feel free to ⭐Star [Nexa's GitHub Repo](https://github.com/NexaAI/nexa-sdk).

Interested in running DeepSeek R1 on your own devices with optimized CPU, GPU, and NPU acceleration or compressing your finetuned DeepSeek-Distill-R1? [Let’s chat!](https://nexa.ai/book-a-call)

[Blogs](https://nexa.ai/blogs/quantized-deepseek-r1-on-device) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)