diffuse-cpp
/

Dream-v0-Instruct-7B-GGUF

Text Generation

masked-diffusion

Model card Files Files and versions

Dream-v0-Instruct-7B-GGUF / README.md

Carmenest's picture

Upload README.md with huggingface_hub

c508417 verified 6 days ago

|

history blame contribute delete

2.98 kB

	---
	license: apache-2.0
	tags:
	- diffusion
	- masked-diffusion
	- dream
	- qwen2
	- gguf
	- diffuse-cpp
	base_model: Dream-org/Dream-v0-Instruct-7B
	pipeline_tag: text-generation
	---

	# Dream-v0-Instruct-7B-GGUF

	GGUF quantizations of [Dream-org/Dream-v0-Instruct-7B](https://huggingface.co/Dream-org/Dream-v0-Instruct-7B) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.

	Dream is a masked diffusion language model based on the Qwen2.5-7B backbone with bidirectional attention and Grouped Query Attention (GQA, 28 query heads / 4 KV heads).

	## Available Quantizations

	\| File \| Type \| Size \| Description \|
	\|------\|------\|------\|-------------\|
	\| `dream-7b-f16.gguf` \| F16 \| ~15 GB \| Full precision, best quality \|
	\| `dream-7b-q8_0.gguf` \| Q8_0 \| ~8.2 GB \| 8-bit quantization, near-lossless \|
	\| `dream-7b-q4km.gguf` \| Q4_K_M \| ~5.0 GB \| 4-bit mixed quantization, best quality/size ratio \|

	Recommended: Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.

	## Performance

	Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, 12 threads, seed=42:

	\| Prompt \| tok/s \| Steps \| vs llama.cpp \|
	\|--------\|-------\|-------\|-------------\|
	\| Capital of France? \| 21.6 \| 2 \| 2.5x \|
	\| Translate to French \| 14.3 \| 6 \| 1.7x \|
	\| 15 x 23? \| 21.6 \| 2 \| 2.5x \|
	\| Translate to Spanish \| 13.2 \| 10 \| 1.6x \|
	\| Python is_prime() \| 8.2 \| 7 \| 1.0x \|
	\| Why sky blue? \| 4.9 \| 16 \| 0.6x \|
	\| List planets \| 4.9 \| 16 \| 0.6x \|
	\| Poem about ocean \| 4.5 \| 16 \| 0.5x \|
	\| Average \| 11.6 \| \| 1.4x \|

	- Easy prompts (factual, math): 14-22 tok/s (1.6-2.5x faster than llama.cpp)
	- Hard prompts (creative, long-form): 4.5-4.9 tok/s
	- llama.cpp baseline: 8.51 tok/s (Qwen2.5-7B-Instruct, Q4_K_M, same hardware)

	## Usage

	```bash
	# Download
	huggingface-cli download diffuse-cpp/Dream-v0-Instruct-7B-GGUF dream-7b-q4km.gguf

	# Run (requires diffuse-cpp v0.2.0+)
	./diffuse-cli -m dream-7b-q4km.gguf -p "What is the capital of France?" -n 64 -s 16
	```

	## Model Details

	- Architecture: Qwen2.5-7B backbone with bidirectional attention
	- Parameters: 7.62B
	- Layers: 28
	- Hidden size: 3584
	- Attention: GQA (28 query heads, 4 KV heads, head dim 128)
	- FFN: SwiGLU, intermediate size 18944
	- Vocabulary: 152,064 tokens
	- RoPE theta: 1,000,000
	- Mask token ID: 151666
	- Training: Masked diffusion on text, with autoregressive logit shift

	## Conversion Details

	Converted from SafeTensors using `convert-dream.py` from diffuse-cpp:
	- 339 tensors total (255 weights + 84 QKV biases)
	- QKV biases kept at F32 in all quantizations
	- Edge layers (first/last) quantized to Q6_K in Q4_K_M scheme

	## Citation

	```bibtex
	@misc{dream2025,
	title={Dream 7B - Scalable Discrete Denoising Diffusion Models for Text Generation},
	author={Ye, Jiacheng and others},
	year={2025}
	}
	```

	## License

	Apache 2.0, following the original Dream model license.