dream-quant / README.md

Update README.md

afbe37f verified 4 months ago

6.71 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- dream-coder
	- diffusion
	- dlm
	---

	# Dream-Coder GGUF Q8_0 Quantization Guide

	This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.

	## Quick Start

	### 1. Environment Setup

	```bash
	# 1. Clone and compile llama.cpp
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp
	make -j$(nproc)

	# 2. Install Python dependencies
	pip install transformers>=4.46.2 torch safetensors numpy
	```

	### 2. Execute Quantization

	#### Method 1: Use the provided script

	```bash
	# Set llama.cpp path
	export LLAMA_CPP_PATH=/path/to/llama.cpp

	# Run quantization script
	./quantize_example.sh
	```

	#### Method 2: Manual execution

	```bash
	python quantize_dream_q8_0.py \
	--model_path /path/to/Dream-Coder-v0-Instruct-7B \
	--llama_cpp_path /path/to/llama.cpp \
	--output_dir ./gguf_output \
	--keep_f16
	```

	### 3. Parameter Description

	- `--model_path`: Dream-Coder model path (default: current directory)
	- `--llama_cpp_path`: llama.cpp project path (required)
	- `--output_dir`: Output directory (default: ./gguf_output)
	- `--keep_f16`: Keep F16 intermediate files

	## Architecture Adaptation

	### Dream-Coder Special Configuration Handling

	This quantization script specifically handles the following special configurations of Dream-Coder:

	1. Architecture Mapping: DreamModel → LlamaForCausalLM (compatibility)
	2. Special Token IDs:
	- `mask_token_id`: 151666 (critical diffusion token)
	- `bos_token_id`: 151665
	- `eos_token_id`: 151643
	- `pad_token_id`: 151643

	3. Model Parameters:
	- Vocabulary size: 152,064
	- Hidden dimension: 3,584
	- Attention heads: 28 (4 key-value heads)
	- Layers: 28
	- Context length: 32,768

	4. Diffusion Features:
	- Preserve `mask_token_id` metadata
	- RoPE theta: 1,000,000.0
	- Activation function: SiLU

	## Output Description

	### File Structure
	```
	gguf_output/
	├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept)
	└── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file
	```

	### Performance Expectations

	\| Metric \| Original (BF16) \| Q8_0 \|
	\|--------\|-----------------\|------\|
	\| Memory Usage \| ~14GB \| ~6.7GB \|
	\| Inference Speed \| 1.0x \| 1.2-1.5x \|
	\| Precision Loss \| 0% \| <0.1% \|

	## Usage

	### llama.cpp Command Line

	Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:

	```bash
	# Basic usage
	./llama.cpp/build/bin/llama-diffusion-cli \
	-m gguf_output/dream-coder-7b-q8_0.gguf \
	-p "def quicksort(arr):" \
	-n 512 \
	-c 2048 \
	--diffusion-steps 128

	# Advanced parameters
	./llama.cpp/build/bin/llama-diffusion-cli \
	-m gguf_output/dream-coder-7b-q8_0.gguf \
	-p "Write a binary search function" \
	-n 256 \
	-c 2048 \
	--temp 0.1 \
	--top-p 0.95 \
	--repeat-penalty 1.1 \
	--diffusion-steps 128 \
	--diffusion-algorithm 4 \
	--diffusion-alg-temp 0.0 \
	-t 8

	# Visualize generation process
	./llama.cpp/build/bin/llama-diffusion-cli \
	-m gguf_output/dream-coder-7b-q8_0.gguf \
	-p "def fibonacci(n):" \
	-n 256 \
	--diffusion-steps 64 \
	--diffusion-visual
	```

	#### Diffusion Parameter Description

	- `--diffusion-steps N`: Diffusion denoising steps (default: 128)
	- `--diffusion-algorithm N`: Algorithm selection:
	- 0 = ORIGIN (original algorithm)
	- 1 = ENTROPY_BASED (entropy-based)
	- 2 = MARGIN_BASED (margin-based)
	- 3 = RANDOM (random)
	- 4 = LOW_CONFIDENCE (low confidence, default)
	- `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
	- `--diffusion-visual`: Enable visualization mode, show generation progress
	- `--diffusion-eps F`: Time step epsilon value

	### Python (llama-cpp-python)

	```bash
	pip install llama-cpp-python
	```

	```python
	from llama_cpp import Llama

	# Load model
	llm = Llama(
	model_path="gguf_output/dream-coder-7b-q8_0.gguf",
	n_ctx=2048,
	n_threads=8,
	n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration
	)

	# Generate code
	output = llm(
	"def fibonacci(n):",
	max_tokens=512,
	temperature=0.1,
	top_p=0.95,
	repeat_penalty=1.1
	)

	print(output['choices'][0]['text'])
	```

	### With GPU Acceleration

	If compiled with CUDA support:

	```bash
	# Compile CUDA version
	cd llama.cpp
	make clean
	make LLAMA_CUBLAS=1 -j$(nproc)

	# Use GPU acceleration (partial layers)
	./build/bin/llama-diffusion-cli \
	-m gguf_output/dream-coder-7b-q8_0.gguf \
	-p "def quicksort(arr):" \
	-n 512 \
	--diffusion-steps 128 \
	-ngl 20 # Number of GPU layers
	```

	## Troubleshooting

	### Common Issues

	1. Conversion Failure:
	- Ensure llama.cpp is compiled correctly
	- Check Python dependency versions
	- Verify model file integrity

	2. Quantization Failure:
	- Check disk space (~20GB temporary space needed)
	- Ensure sufficient memory (32GB+ recommended)

	3. Inference Errors:
	- Verify GGUF file integrity
	- Check context length settings
	- Try reducing `n_gpu_layers`

	### Model Validation

	```bash
	# File integrity check
	ls -lh gguf_output/dream-coder-7b-q8_0.gguf

	# Simple inference test
	echo "def hello():" \| ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
	```

	## Performance Optimization

	### CPU Optimization
	- Use `-t` parameter to set thread count
	- Enable AVX2/AVX512 compilation options
	- Adjust batch size (`-b` parameter)

	### GPU Optimization
	- Use CUDA/OpenCL compilation
	- Adjust GPU layer count (`-ngl`)
	- Monitor GPU memory usage

	### Memory Optimization
	- Use `--mmap` to enable memory mapping
	- Adjust `--mlock` parameter
	- Set appropriate context length

	## Important Notes

	1. Diffusion Features: Dream-Coder uses diffusion generation, different from traditional autoregressive models
	2. Dedicated Tool: Must use `llama-diffusion-cli` instead of the regular `main` tool
	3. Special Tokens: Maintain correct handling of `mask_token_id` (151666)
	4. Context Length: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
	5. Generation Parameters: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
	6. Diffusion Steps: Recommend 64-128 steps, more steps may improve quality but increase inference time

	## Technical Support

	If you encounter issues, please check:
	1. llama.cpp version and compilation status
	2. Python dependency version compatibility
	3. Model file integrity
	4. System resources (memory/disk)

	For more information, refer to:
	- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
	- [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)