danielhanchen commited on
Commit
12c2900
·
verified ·
1 Parent(s): aaffc4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -4
README.md CHANGED
@@ -19,7 +19,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
  2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
21
  3. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
22
- ```bash
23
  ./llama.cpp/llama-cli \
24
  --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
25
  --cache-type-k q4_0 \
@@ -28,7 +28,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
28
  --ctx-size 8192 \
29
  --seed 3407 \
30
  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
31
- ```
32
  Example output:
33
 
34
  ```txt
@@ -41,7 +41,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
41
  ```
42
 
43
  4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
44
- ```bash
45
  ./llama.cpp/llama-cli \
46
  --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
47
  --cache-type-k q4_0 \
@@ -51,7 +51,7 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
51
  --ctx-size 8192 \
52
  --seed 3407 \
53
  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
54
- ```
55
  5. If you want to merge the weights together, use this script:
56
  ```
57
  ./llama.cpp/llama-gguf-split --merge \
@@ -59,6 +59,13 @@ Or you can view more detailed instructions here: [unsloth.ai/blog/deepseek-r1](h
59
  merged_file.gguf
60
  ```
61
 
 
 
 
 
 
 
 
62
  # Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
63
  We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
64
 
 
19
  1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
20
  2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp
21
  3. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
22
+ ```bash
23
  ./llama.cpp/llama-cli \
24
  --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
25
  --cache-type-k q4_0 \
 
28
  --ctx-size 8192 \
29
  --seed 3407 \
30
  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
31
+ ```
32
  Example output:
33
 
34
  ```txt
 
41
  ```
42
 
43
  4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
44
+ ```bash
45
  ./llama.cpp/llama-cli \
46
  --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
47
  --cache-type-k q4_0 \
 
51
  --ctx-size 8192 \
52
  --seed 3407 \
53
  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
54
+ ```
55
  5. If you want to merge the weights together, use this script:
56
  ```
57
  ./llama.cpp/llama-gguf-split --merge \
 
59
  merged_file.gguf
60
  ```
61
 
62
+ | MoE Bits | Type | Disk Size | Accuracy | Link | Details |
63
+ | -------- | -------- | ------------ | ------------ | ---------------------| ---------- |
64
+ | 1.58bit | IQ1_S | **131GB** | Fair | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S) | MoE all 1.56bit. `down_proj` in MoE mixture of 2.06/1.56bit |
65
+ | 1.73bit | IQ1_M | **158GB** | Good | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M) | MoE all 1.56bit. `down_proj` in MoE left at 2.06bit |
66
+ | 2.22bit | IQ2_XXS | **183GB** | Better | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ2_XXS) | MoE all 2.06bit. `down_proj` in MoE mixture of 2.5/2.06bit |
67
+ | 2.51bit | Q2_K_XL | **212GB** | Best | [Link](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
68
+
69
  # Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
70
  We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
71