--- license: apache-2.0 language: - zh - en tags: - dream-coder - diffusion - dlm --- # Dream-Coder GGUF Q8_0 Quantization Guide This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model. ## Quick Start ### 1. Environment Setup ```bash # 1. Clone and compile llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make -j$(nproc) # 2. Install Python dependencies pip install transformers>=4.46.2 torch safetensors numpy ``` ### 2. Execute Quantization #### Method 1: Use the provided script ```bash # Set llama.cpp path export LLAMA_CPP_PATH=/path/to/llama.cpp # Run quantization script ./quantize_example.sh ``` #### Method 2: Manual execution ```bash python quantize_dream_q8_0.py \ --model_path /path/to/Dream-Coder-v0-Instruct-7B \ --llama_cpp_path /path/to/llama.cpp \ --output_dir ./gguf_output \ --keep_f16 ``` ### 3. Parameter Description - `--model_path`: Dream-Coder model path (default: current directory) - `--llama_cpp_path`: llama.cpp project path (required) - `--output_dir`: Output directory (default: ./gguf_output) - `--keep_f16`: Keep F16 intermediate files ## Architecture Adaptation ### Dream-Coder Special Configuration Handling This quantization script specifically handles the following special configurations of Dream-Coder: 1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility) 2. **Special Token IDs**: - `mask_token_id`: 151666 (critical diffusion token) - `bos_token_id`: 151665 - `eos_token_id`: 151643 - `pad_token_id`: 151643 3. **Model Parameters**: - Vocabulary size: 152,064 - Hidden dimension: 3,584 - Attention heads: 28 (4 key-value heads) - Layers: 28 - Context length: 32,768 4. **Diffusion Features**: - Preserve `mask_token_id` metadata - RoPE theta: 1,000,000.0 - Activation function: SiLU ## Output Description ### File Structure ``` gguf_output/ ├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept) └── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file ``` ### Performance Expectations | Metric | Original (BF16) | Q8_0 | |--------|-----------------|------| | Memory Usage | ~14GB | ~6.7GB | | Inference Speed | 1.0x | 1.2-1.5x | | Precision Loss | 0% | <0.1% | ## Usage ### llama.cpp Command Line Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool: ```bash # Basic usage ./llama.cpp/build/bin/llama-diffusion-cli \ -m gguf_output/dream-coder-7b-q8_0.gguf \ -p "def quicksort(arr):" \ -n 512 \ -c 2048 \ --diffusion-steps 128 # Advanced parameters ./llama.cpp/build/bin/llama-diffusion-cli \ -m gguf_output/dream-coder-7b-q8_0.gguf \ -p "Write a binary search function" \ -n 256 \ -c 2048 \ --temp 0.1 \ --top-p 0.95 \ --repeat-penalty 1.1 \ --diffusion-steps 128 \ --diffusion-algorithm 4 \ --diffusion-alg-temp 0.0 \ -t 8 # Visualize generation process ./llama.cpp/build/bin/llama-diffusion-cli \ -m gguf_output/dream-coder-7b-q8_0.gguf \ -p "def fibonacci(n):" \ -n 256 \ --diffusion-steps 64 \ --diffusion-visual ``` #### Diffusion Parameter Description - `--diffusion-steps N`: Diffusion denoising steps (default: 128) - `--diffusion-algorithm N`: Algorithm selection: - 0 = ORIGIN (original algorithm) - 1 = ENTROPY_BASED (entropy-based) - 2 = MARGIN_BASED (margin-based) - 3 = RANDOM (random) - 4 = LOW_CONFIDENCE (low confidence, default) - `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0) - `--diffusion-visual`: Enable visualization mode, show generation progress - `--diffusion-eps F`: Time step epsilon value ### Python (llama-cpp-python) ```bash pip install llama-cpp-python ``` ```python from llama_cpp import Llama # Load model llm = Llama( model_path="gguf_output/dream-coder-7b-q8_0.gguf", n_ctx=2048, n_threads=8, n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration ) # Generate code output = llm( "def fibonacci(n):", max_tokens=512, temperature=0.1, top_p=0.95, repeat_penalty=1.1 ) print(output['choices'][0]['text']) ``` ### With GPU Acceleration If compiled with CUDA support: ```bash # Compile CUDA version cd llama.cpp make clean make LLAMA_CUBLAS=1 -j$(nproc) # Use GPU acceleration (partial layers) ./build/bin/llama-diffusion-cli \ -m gguf_output/dream-coder-7b-q8_0.gguf \ -p "def quicksort(arr):" \ -n 512 \ --diffusion-steps 128 \ -ngl 20 # Number of GPU layers ``` ## Troubleshooting ### Common Issues 1. **Conversion Failure**: - Ensure llama.cpp is compiled correctly - Check Python dependency versions - Verify model file integrity 2. **Quantization Failure**: - Check disk space (~20GB temporary space needed) - Ensure sufficient memory (32GB+ recommended) 3. **Inference Errors**: - Verify GGUF file integrity - Check context length settings - Try reducing `n_gpu_layers` ### Model Validation ```bash # File integrity check ls -lh gguf_output/dream-coder-7b-q8_0.gguf # Simple inference test echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64 ``` ## Performance Optimization ### CPU Optimization - Use `-t` parameter to set thread count - Enable AVX2/AVX512 compilation options - Adjust batch size (`-b` parameter) ### GPU Optimization - Use CUDA/OpenCL compilation - Adjust GPU layer count (`-ngl`) - Monitor GPU memory usage ### Memory Optimization - Use `--mmap` to enable memory mapping - Adjust `--mlock` parameter - Set appropriate context length ## Important Notes 1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models 2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool 3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666) 4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance 5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95) 6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time ## Technical Support If you encounter issues, please check: 1. llama.cpp version and compilation status 2. Python dependency version compatibility 3. Model file integrity 4. System resources (memory/disk) For more information, refer to: - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) - [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)