Update README.md
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
|
|
| 19 |
1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
|
| 20 |
2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
|
| 21 |
3. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
|
| 22 |
-
|
| 23 |
./llama.cpp/llama-cli \
|
| 24 |
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
| 25 |
--cache-type-k q4_0 \
|
|
@@ -28,7 +28,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
|
|
| 28 |
--ctx-size 8192 \
|
| 29 |
--seed 3407 \
|
| 30 |
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
|
| 31 |
-
|
| 32 |
Example output:
|
| 33 |
|
| 34 |
```txt
|
|
@@ -41,7 +41,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
|
|
| 41 |
```
|
| 42 |
|
| 43 |
4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
|
| 44 |
-
|
| 45 |
./llama.cpp/llama-cli \
|
| 46 |
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
| 47 |
--cache-type-k q4_0 \
|
|
@@ -51,7 +51,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
|
|
| 51 |
--ctx-size 8192 \
|
| 52 |
--seed 3407 \
|
| 53 |
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
|
| 54 |
-
|
| 55 |
5. If you want to merge the weights together, use this script:
|
| 56 |
```
|
| 57 |
./llama.cpp/llama-gguf-split --merge \
|
|
@@ -59,6 +59,13 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
|
|
| 59 |
merged_file.gguf
|
| 60 |
```
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
# Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
|
| 63 |
We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
|
| 64 |
|
|
|
|
| 19 |
1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
|
| 20 |
2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
|
| 21 |
3. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
|
| 22 |
+
```bash
|
| 23 |
./llama.cpp/llama-cli \
|
| 24 |
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
| 25 |
--cache-type-k q4_0 \
|
|
|
|
| 28 |
--ctx-size 8192 \
|
| 29 |
--seed 3407 \
|
| 30 |
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
|
| 31 |
+
```
|
| 32 |
Example output:
|
| 33 |
|
| 34 |
```txt
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
|
| 44 |
+
```bash
|
| 45 |
./llama.cpp/llama-cli \
|
| 46 |
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
|
| 47 |
--cache-type-k q4_0 \
|
|
|
|
| 51 |
--ctx-size 8192 \
|
| 52 |
--seed 3407 \
|
| 53 |
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
|
| 54 |
+
```
|
| 55 |
5. If you want to merge the weights together, use this script:
|
| 56 |
```
|
| 57 |
./llama.cpp/llama-gguf-split --merge \
|
|
|
|
| 59 |
merged_file.gguf
|
| 60 |
```
|
| 61 |
|
| 62 |
+
| MoE Bits | Type | Disk Size | Accuracy | Link | Details |
|
| 63 |
+
| -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
|
| 64 |
+
| 1.58bit | IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
|
| 65 |
+
| 1.73bit | IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
|
| 66 |
+
| 2.22bit | IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
|
| 67 |
+
| 2.51bit | Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
|
| 68 |
+
|
| 69 |
# Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
|
| 70 |
We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
|
| 71 |
|