|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
tags: |
|
|
- dream-coder |
|
|
- diffusion |
|
|
- dlm |
|
|
--- |
|
|
|
|
|
# Dream-Coder GGUF Q8_0 Quantization Guide |
|
|
|
|
|
This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### 1. Environment Setup |
|
|
|
|
|
```bash |
|
|
# 1. Clone and compile llama.cpp |
|
|
git clone https://github.com/ggerganov/llama.cpp |
|
|
cd llama.cpp |
|
|
make -j$(nproc) |
|
|
|
|
|
# 2. Install Python dependencies |
|
|
pip install transformers>=4.46.2 torch safetensors numpy |
|
|
``` |
|
|
|
|
|
### 2. Execute Quantization |
|
|
|
|
|
#### Method 1: Use the provided script |
|
|
|
|
|
```bash |
|
|
# Set llama.cpp path |
|
|
export LLAMA_CPP_PATH=/path/to/llama.cpp |
|
|
|
|
|
# Run quantization script |
|
|
./quantize_example.sh |
|
|
``` |
|
|
|
|
|
#### Method 2: Manual execution |
|
|
|
|
|
```bash |
|
|
python quantize_dream_q8_0.py \ |
|
|
--model_path /path/to/Dream-Coder-v0-Instruct-7B \ |
|
|
--llama_cpp_path /path/to/llama.cpp \ |
|
|
--output_dir ./gguf_output \ |
|
|
--keep_f16 |
|
|
``` |
|
|
|
|
|
### 3. Parameter Description |
|
|
|
|
|
- `--model_path`: Dream-Coder model path (default: current directory) |
|
|
- `--llama_cpp_path`: llama.cpp project path (required) |
|
|
- `--output_dir`: Output directory (default: ./gguf_output) |
|
|
- `--keep_f16`: Keep F16 intermediate files |
|
|
|
|
|
## Architecture Adaptation |
|
|
|
|
|
### Dream-Coder Special Configuration Handling |
|
|
|
|
|
This quantization script specifically handles the following special configurations of Dream-Coder: |
|
|
|
|
|
1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility) |
|
|
2. **Special Token IDs**: |
|
|
- `mask_token_id`: 151666 (critical diffusion token) |
|
|
- `bos_token_id`: 151665 |
|
|
- `eos_token_id`: 151643 |
|
|
- `pad_token_id`: 151643 |
|
|
|
|
|
3. **Model Parameters**: |
|
|
- Vocabulary size: 152,064 |
|
|
- Hidden dimension: 3,584 |
|
|
- Attention heads: 28 (4 key-value heads) |
|
|
- Layers: 28 |
|
|
- Context length: 32,768 |
|
|
|
|
|
4. **Diffusion Features**: |
|
|
- Preserve `mask_token_id` metadata |
|
|
- RoPE theta: 1,000,000.0 |
|
|
- Activation function: SiLU |
|
|
|
|
|
## Output Description |
|
|
|
|
|
### File Structure |
|
|
``` |
|
|
gguf_output/ |
|
|
├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept) |
|
|
└── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file |
|
|
``` |
|
|
|
|
|
### Performance Expectations |
|
|
|
|
|
| Metric | Original (BF16) | Q8_0 | |
|
|
|--------|-----------------|------| |
|
|
| Memory Usage | ~14GB | ~6.7GB | |
|
|
| Inference Speed | 1.0x | 1.2-1.5x | |
|
|
| Precision Loss | 0% | <0.1% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### llama.cpp Command Line |
|
|
|
|
|
Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool: |
|
|
|
|
|
```bash |
|
|
# Basic usage |
|
|
./llama.cpp/build/bin/llama-diffusion-cli \ |
|
|
-m gguf_output/dream-coder-7b-q8_0.gguf \ |
|
|
-p "def quicksort(arr):" \ |
|
|
-n 512 \ |
|
|
-c 2048 \ |
|
|
--diffusion-steps 128 |
|
|
|
|
|
# Advanced parameters |
|
|
./llama.cpp/build/bin/llama-diffusion-cli \ |
|
|
-m gguf_output/dream-coder-7b-q8_0.gguf \ |
|
|
-p "Write a binary search function" \ |
|
|
-n 256 \ |
|
|
-c 2048 \ |
|
|
--temp 0.1 \ |
|
|
--top-p 0.95 \ |
|
|
--repeat-penalty 1.1 \ |
|
|
--diffusion-steps 128 \ |
|
|
--diffusion-algorithm 4 \ |
|
|
--diffusion-alg-temp 0.0 \ |
|
|
-t 8 |
|
|
|
|
|
# Visualize generation process |
|
|
./llama.cpp/build/bin/llama-diffusion-cli \ |
|
|
-m gguf_output/dream-coder-7b-q8_0.gguf \ |
|
|
-p "def fibonacci(n):" \ |
|
|
-n 256 \ |
|
|
--diffusion-steps 64 \ |
|
|
--diffusion-visual |
|
|
``` |
|
|
|
|
|
#### Diffusion Parameter Description |
|
|
|
|
|
- `--diffusion-steps N`: Diffusion denoising steps (default: 128) |
|
|
- `--diffusion-algorithm N`: Algorithm selection: |
|
|
- 0 = ORIGIN (original algorithm) |
|
|
- 1 = ENTROPY_BASED (entropy-based) |
|
|
- 2 = MARGIN_BASED (margin-based) |
|
|
- 3 = RANDOM (random) |
|
|
- 4 = LOW_CONFIDENCE (low confidence, default) |
|
|
- `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0) |
|
|
- `--diffusion-visual`: Enable visualization mode, show generation progress |
|
|
- `--diffusion-eps F`: Time step epsilon value |
|
|
|
|
|
### Python (llama-cpp-python) |
|
|
|
|
|
```bash |
|
|
pip install llama-cpp-python |
|
|
``` |
|
|
|
|
|
```python |
|
|
from llama_cpp import Llama |
|
|
|
|
|
# Load model |
|
|
llm = Llama( |
|
|
model_path="gguf_output/dream-coder-7b-q8_0.gguf", |
|
|
n_ctx=2048, |
|
|
n_threads=8, |
|
|
n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration |
|
|
) |
|
|
|
|
|
# Generate code |
|
|
output = llm( |
|
|
"def fibonacci(n):", |
|
|
max_tokens=512, |
|
|
temperature=0.1, |
|
|
top_p=0.95, |
|
|
repeat_penalty=1.1 |
|
|
) |
|
|
|
|
|
print(output['choices'][0]['text']) |
|
|
``` |
|
|
|
|
|
### With GPU Acceleration |
|
|
|
|
|
If compiled with CUDA support: |
|
|
|
|
|
```bash |
|
|
# Compile CUDA version |
|
|
cd llama.cpp |
|
|
make clean |
|
|
make LLAMA_CUBLAS=1 -j$(nproc) |
|
|
|
|
|
# Use GPU acceleration (partial layers) |
|
|
./build/bin/llama-diffusion-cli \ |
|
|
-m gguf_output/dream-coder-7b-q8_0.gguf \ |
|
|
-p "def quicksort(arr):" \ |
|
|
-n 512 \ |
|
|
--diffusion-steps 128 \ |
|
|
-ngl 20 # Number of GPU layers |
|
|
``` |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
### Common Issues |
|
|
|
|
|
1. **Conversion Failure**: |
|
|
- Ensure llama.cpp is compiled correctly |
|
|
- Check Python dependency versions |
|
|
- Verify model file integrity |
|
|
|
|
|
2. **Quantization Failure**: |
|
|
- Check disk space (~20GB temporary space needed) |
|
|
- Ensure sufficient memory (32GB+ recommended) |
|
|
|
|
|
3. **Inference Errors**: |
|
|
- Verify GGUF file integrity |
|
|
- Check context length settings |
|
|
- Try reducing `n_gpu_layers` |
|
|
|
|
|
### Model Validation |
|
|
|
|
|
```bash |
|
|
# File integrity check |
|
|
ls -lh gguf_output/dream-coder-7b-q8_0.gguf |
|
|
|
|
|
# Simple inference test |
|
|
echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64 |
|
|
``` |
|
|
|
|
|
## Performance Optimization |
|
|
|
|
|
### CPU Optimization |
|
|
- Use `-t` parameter to set thread count |
|
|
- Enable AVX2/AVX512 compilation options |
|
|
- Adjust batch size (`-b` parameter) |
|
|
|
|
|
### GPU Optimization |
|
|
- Use CUDA/OpenCL compilation |
|
|
- Adjust GPU layer count (`-ngl`) |
|
|
- Monitor GPU memory usage |
|
|
|
|
|
### Memory Optimization |
|
|
- Use `--mmap` to enable memory mapping |
|
|
- Adjust `--mlock` parameter |
|
|
- Set appropriate context length |
|
|
|
|
|
## Important Notes |
|
|
|
|
|
1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models |
|
|
2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool |
|
|
3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666) |
|
|
4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance |
|
|
5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95) |
|
|
6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time |
|
|
|
|
|
## Technical Support |
|
|
|
|
|
If you encounter issues, please check: |
|
|
1. llama.cpp version and compilation status |
|
|
2. Python dependency version compatibility |
|
|
3. Model file integrity |
|
|
4. System resources (memory/disk) |
|
|
|
|
|
For more information, refer to: |
|
|
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) |
|
|
- [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) |