eric8810 commited on
Commit
afbe37f
·
verified ·
1 Parent(s): bd713c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +264 -3
README.md CHANGED
@@ -1,3 +1,264 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ tags:
7
+ - dream-coder
8
+ - diffusion
9
+ - dlm
10
+ ---
11
+
12
+ # Dream-Coder GGUF Q8_0 Quantization Guide
13
+
14
+ This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.
15
+
16
+ ## Quick Start
17
+
18
+ ### 1. Environment Setup
19
+
20
+ ```bash
21
+ # 1. Clone and compile llama.cpp
22
+ git clone https://github.com/ggerganov/llama.cpp
23
+ cd llama.cpp
24
+ make -j$(nproc)
25
+
26
+ # 2. Install Python dependencies
27
+ pip install transformers>=4.46.2 torch safetensors numpy
28
+ ```
29
+
30
+ ### 2. Execute Quantization
31
+
32
+ #### Method 1: Use the provided script
33
+
34
+ ```bash
35
+ # Set llama.cpp path
36
+ export LLAMA_CPP_PATH=/path/to/llama.cpp
37
+
38
+ # Run quantization script
39
+ ./quantize_example.sh
40
+ ```
41
+
42
+ #### Method 2: Manual execution
43
+
44
+ ```bash
45
+ python quantize_dream_q8_0.py \
46
+ --model_path /path/to/Dream-Coder-v0-Instruct-7B \
47
+ --llama_cpp_path /path/to/llama.cpp \
48
+ --output_dir ./gguf_output \
49
+ --keep_f16
50
+ ```
51
+
52
+ ### 3. Parameter Description
53
+
54
+ - `--model_path`: Dream-Coder model path (default: current directory)
55
+ - `--llama_cpp_path`: llama.cpp project path (required)
56
+ - `--output_dir`: Output directory (default: ./gguf_output)
57
+ - `--keep_f16`: Keep F16 intermediate files
58
+
59
+ ## Architecture Adaptation
60
+
61
+ ### Dream-Coder Special Configuration Handling
62
+
63
+ This quantization script specifically handles the following special configurations of Dream-Coder:
64
+
65
+ 1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility)
66
+ 2. **Special Token IDs**:
67
+ - `mask_token_id`: 151666 (critical diffusion token)
68
+ - `bos_token_id`: 151665
69
+ - `eos_token_id`: 151643
70
+ - `pad_token_id`: 151643
71
+
72
+ 3. **Model Parameters**:
73
+ - Vocabulary size: 152,064
74
+ - Hidden dimension: 3,584
75
+ - Attention heads: 28 (4 key-value heads)
76
+ - Layers: 28
77
+ - Context length: 32,768
78
+
79
+ 4. **Diffusion Features**:
80
+ - Preserve `mask_token_id` metadata
81
+ - RoPE theta: 1,000,000.0
82
+ - Activation function: SiLU
83
+
84
+ ## Output Description
85
+
86
+ ### File Structure
87
+ ```
88
+ gguf_output/
89
+ ├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept)
90
+ └── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file
91
+ ```
92
+
93
+ ### Performance Expectations
94
+
95
+ | Metric | Original (BF16) | Q8_0 |
96
+ |--------|-----------------|------|
97
+ | Memory Usage | ~14GB | ~6.7GB |
98
+ | Inference Speed | 1.0x | 1.2-1.5x |
99
+ | Precision Loss | 0% | <0.1% |
100
+
101
+ ## Usage
102
+
103
+ ### llama.cpp Command Line
104
+
105
+ Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:
106
+
107
+ ```bash
108
+ # Basic usage
109
+ ./llama.cpp/build/bin/llama-diffusion-cli \
110
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
111
+ -p "def quicksort(arr):" \
112
+ -n 512 \
113
+ -c 2048 \
114
+ --diffusion-steps 128
115
+
116
+ # Advanced parameters
117
+ ./llama.cpp/build/bin/llama-diffusion-cli \
118
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
119
+ -p "Write a binary search function" \
120
+ -n 256 \
121
+ -c 2048 \
122
+ --temp 0.1 \
123
+ --top-p 0.95 \
124
+ --repeat-penalty 1.1 \
125
+ --diffusion-steps 128 \
126
+ --diffusion-algorithm 4 \
127
+ --diffusion-alg-temp 0.0 \
128
+ -t 8
129
+
130
+ # Visualize generation process
131
+ ./llama.cpp/build/bin/llama-diffusion-cli \
132
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
133
+ -p "def fibonacci(n):" \
134
+ -n 256 \
135
+ --diffusion-steps 64 \
136
+ --diffusion-visual
137
+ ```
138
+
139
+ #### Diffusion Parameter Description
140
+
141
+ - `--diffusion-steps N`: Diffusion denoising steps (default: 128)
142
+ - `--diffusion-algorithm N`: Algorithm selection:
143
+ - 0 = ORIGIN (original algorithm)
144
+ - 1 = ENTROPY_BASED (entropy-based)
145
+ - 2 = MARGIN_BASED (margin-based)
146
+ - 3 = RANDOM (random)
147
+ - 4 = LOW_CONFIDENCE (low confidence, default)
148
+ - `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
149
+ - `--diffusion-visual`: Enable visualization mode, show generation progress
150
+ - `--diffusion-eps F`: Time step epsilon value
151
+
152
+ ### Python (llama-cpp-python)
153
+
154
+ ```bash
155
+ pip install llama-cpp-python
156
+ ```
157
+
158
+ ```python
159
+ from llama_cpp import Llama
160
+
161
+ # Load model
162
+ llm = Llama(
163
+ model_path="gguf_output/dream-coder-7b-q8_0.gguf",
164
+ n_ctx=2048,
165
+ n_threads=8,
166
+ n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration
167
+ )
168
+
169
+ # Generate code
170
+ output = llm(
171
+ "def fibonacci(n):",
172
+ max_tokens=512,
173
+ temperature=0.1,
174
+ top_p=0.95,
175
+ repeat_penalty=1.1
176
+ )
177
+
178
+ print(output['choices'][0]['text'])
179
+ ```
180
+
181
+ ### With GPU Acceleration
182
+
183
+ If compiled with CUDA support:
184
+
185
+ ```bash
186
+ # Compile CUDA version
187
+ cd llama.cpp
188
+ make clean
189
+ make LLAMA_CUBLAS=1 -j$(nproc)
190
+
191
+ # Use GPU acceleration (partial layers)
192
+ ./build/bin/llama-diffusion-cli \
193
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
194
+ -p "def quicksort(arr):" \
195
+ -n 512 \
196
+ --diffusion-steps 128 \
197
+ -ngl 20 # Number of GPU layers
198
+ ```
199
+
200
+ ## Troubleshooting
201
+
202
+ ### Common Issues
203
+
204
+ 1. **Conversion Failure**:
205
+ - Ensure llama.cpp is compiled correctly
206
+ - Check Python dependency versions
207
+ - Verify model file integrity
208
+
209
+ 2. **Quantization Failure**:
210
+ - Check disk space (~20GB temporary space needed)
211
+ - Ensure sufficient memory (32GB+ recommended)
212
+
213
+ 3. **Inference Errors**:
214
+ - Verify GGUF file integrity
215
+ - Check context length settings
216
+ - Try reducing `n_gpu_layers`
217
+
218
+ ### Model Validation
219
+
220
+ ```bash
221
+ # File integrity check
222
+ ls -lh gguf_output/dream-coder-7b-q8_0.gguf
223
+
224
+ # Simple inference test
225
+ echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
226
+ ```
227
+
228
+ ## Performance Optimization
229
+
230
+ ### CPU Optimization
231
+ - Use `-t` parameter to set thread count
232
+ - Enable AVX2/AVX512 compilation options
233
+ - Adjust batch size (`-b` parameter)
234
+
235
+ ### GPU Optimization
236
+ - Use CUDA/OpenCL compilation
237
+ - Adjust GPU layer count (`-ngl`)
238
+ - Monitor GPU memory usage
239
+
240
+ ### Memory Optimization
241
+ - Use `--mmap` to enable memory mapping
242
+ - Adjust `--mlock` parameter
243
+ - Set appropriate context length
244
+
245
+ ## Important Notes
246
+
247
+ 1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models
248
+ 2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool
249
+ 3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666)
250
+ 4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
251
+ 5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
252
+ 6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time
253
+
254
+ ## Technical Support
255
+
256
+ If you encounter issues, please check:
257
+ 1. llama.cpp version and compilation status
258
+ 2. Python dependency version compatibility
259
+ 3. Model file integrity
260
+ 4. System resources (memory/disk)
261
+
262
+ For more information, refer to:
263
+ - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
264
+ - [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)