dream-quant / README.md
eric8810's picture
Update README.md
afbe37f verified
---
license: apache-2.0
language:
- zh
- en
tags:
- dream-coder
- diffusion
- dlm
---
# Dream-Coder GGUF Q8_0 Quantization Guide
This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.
## Quick Start
### 1. Environment Setup
```bash
# 1. Clone and compile llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
# 2. Install Python dependencies
pip install transformers>=4.46.2 torch safetensors numpy
```
### 2. Execute Quantization
#### Method 1: Use the provided script
```bash
# Set llama.cpp path
export LLAMA_CPP_PATH=/path/to/llama.cpp
# Run quantization script
./quantize_example.sh
```
#### Method 2: Manual execution
```bash
python quantize_dream_q8_0.py \
--model_path /path/to/Dream-Coder-v0-Instruct-7B \
--llama_cpp_path /path/to/llama.cpp \
--output_dir ./gguf_output \
--keep_f16
```
### 3. Parameter Description
- `--model_path`: Dream-Coder model path (default: current directory)
- `--llama_cpp_path`: llama.cpp project path (required)
- `--output_dir`: Output directory (default: ./gguf_output)
- `--keep_f16`: Keep F16 intermediate files
## Architecture Adaptation
### Dream-Coder Special Configuration Handling
This quantization script specifically handles the following special configurations of Dream-Coder:
1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility)
2. **Special Token IDs**:
- `mask_token_id`: 151666 (critical diffusion token)
- `bos_token_id`: 151665
- `eos_token_id`: 151643
- `pad_token_id`: 151643
3. **Model Parameters**:
- Vocabulary size: 152,064
- Hidden dimension: 3,584
- Attention heads: 28 (4 key-value heads)
- Layers: 28
- Context length: 32,768
4. **Diffusion Features**:
- Preserve `mask_token_id` metadata
- RoPE theta: 1,000,000.0
- Activation function: SiLU
## Output Description
### File Structure
```
gguf_output/
├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept)
└── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file
```
### Performance Expectations
| Metric | Original (BF16) | Q8_0 |
|--------|-----------------|------|
| Memory Usage | ~14GB | ~6.7GB |
| Inference Speed | 1.0x | 1.2-1.5x |
| Precision Loss | 0% | <0.1% |
## Usage
### llama.cpp Command Line
Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:
```bash
# Basic usage
./llama.cpp/build/bin/llama-diffusion-cli \
-m gguf_output/dream-coder-7b-q8_0.gguf \
-p "def quicksort(arr):" \
-n 512 \
-c 2048 \
--diffusion-steps 128
# Advanced parameters
./llama.cpp/build/bin/llama-diffusion-cli \
-m gguf_output/dream-coder-7b-q8_0.gguf \
-p "Write a binary search function" \
-n 256 \
-c 2048 \
--temp 0.1 \
--top-p 0.95 \
--repeat-penalty 1.1 \
--diffusion-steps 128 \
--diffusion-algorithm 4 \
--diffusion-alg-temp 0.0 \
-t 8
# Visualize generation process
./llama.cpp/build/bin/llama-diffusion-cli \
-m gguf_output/dream-coder-7b-q8_0.gguf \
-p "def fibonacci(n):" \
-n 256 \
--diffusion-steps 64 \
--diffusion-visual
```
#### Diffusion Parameter Description
- `--diffusion-steps N`: Diffusion denoising steps (default: 128)
- `--diffusion-algorithm N`: Algorithm selection:
- 0 = ORIGIN (original algorithm)
- 1 = ENTROPY_BASED (entropy-based)
- 2 = MARGIN_BASED (margin-based)
- 3 = RANDOM (random)
- 4 = LOW_CONFIDENCE (low confidence, default)
- `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
- `--diffusion-visual`: Enable visualization mode, show generation progress
- `--diffusion-eps F`: Time step epsilon value
### Python (llama-cpp-python)
```bash
pip install llama-cpp-python
```
```python
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="gguf_output/dream-coder-7b-q8_0.gguf",
n_ctx=2048,
n_threads=8,
n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration
)
# Generate code
output = llm(
"def fibonacci(n):",
max_tokens=512,
temperature=0.1,
top_p=0.95,
repeat_penalty=1.1
)
print(output['choices'][0]['text'])
```
### With GPU Acceleration
If compiled with CUDA support:
```bash
# Compile CUDA version
cd llama.cpp
make clean
make LLAMA_CUBLAS=1 -j$(nproc)
# Use GPU acceleration (partial layers)
./build/bin/llama-diffusion-cli \
-m gguf_output/dream-coder-7b-q8_0.gguf \
-p "def quicksort(arr):" \
-n 512 \
--diffusion-steps 128 \
-ngl 20 # Number of GPU layers
```
## Troubleshooting
### Common Issues
1. **Conversion Failure**:
- Ensure llama.cpp is compiled correctly
- Check Python dependency versions
- Verify model file integrity
2. **Quantization Failure**:
- Check disk space (~20GB temporary space needed)
- Ensure sufficient memory (32GB+ recommended)
3. **Inference Errors**:
- Verify GGUF file integrity
- Check context length settings
- Try reducing `n_gpu_layers`
### Model Validation
```bash
# File integrity check
ls -lh gguf_output/dream-coder-7b-q8_0.gguf
# Simple inference test
echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
```
## Performance Optimization
### CPU Optimization
- Use `-t` parameter to set thread count
- Enable AVX2/AVX512 compilation options
- Adjust batch size (`-b` parameter)
### GPU Optimization
- Use CUDA/OpenCL compilation
- Adjust GPU layer count (`-ngl`)
- Monitor GPU memory usage
### Memory Optimization
- Use `--mmap` to enable memory mapping
- Adjust `--mlock` parameter
- Set appropriate context length
## Important Notes
1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models
2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool
3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666)
4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time
## Technical Support
If you encounter issues, please check:
1. llama.cpp version and compilation status
2. Python dependency version compatibility
3. Model file integrity
4. System resources (memory/disk)
For more information, refer to:
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)