File size: 2,978 Bytes
c508417
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
tags:
- diffusion
- masked-diffusion
- dream
- qwen2
- gguf
- diffuse-cpp
base_model: Dream-org/Dream-v0-Instruct-7B
pipeline_tag: text-generation
---

# Dream-v0-Instruct-7B-GGUF

GGUF quantizations of [Dream-org/Dream-v0-Instruct-7B](https://huggingface.co/Dream-org/Dream-v0-Instruct-7B) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.

Dream is a masked diffusion language model based on the Qwen2.5-7B backbone with bidirectional attention and Grouped Query Attention (GQA, 28 query heads / 4 KV heads).

## Available Quantizations

| File | Type | Size | Description |
|------|------|------|-------------|
| `dream-7b-f16.gguf` | F16 | ~15 GB | Full precision, best quality |
| `dream-7b-q8_0.gguf` | Q8_0 | ~8.2 GB | 8-bit quantization, near-lossless |
| `dream-7b-q4km.gguf` | Q4_K_M | ~5.0 GB | 4-bit mixed quantization, best quality/size ratio |

**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.

## Performance

Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, 12 threads, seed=42:

| Prompt | tok/s | Steps | vs llama.cpp |
|--------|-------|-------|-------------|
| Capital of France? | 21.6 | 2 | 2.5x |
| Translate to French | 14.3 | 6 | 1.7x |
| 15 x 23? | 21.6 | 2 | 2.5x |
| Translate to Spanish | 13.2 | 10 | 1.6x |
| Python is_prime() | 8.2 | 7 | 1.0x |
| Why sky blue? | 4.9 | 16 | 0.6x |
| List planets | 4.9 | 16 | 0.6x |
| Poem about ocean | 4.5 | 16 | 0.5x |
| **Average** | **11.6** | | **1.4x** |

- Easy prompts (factual, math): **14-22 tok/s** (1.6-2.5x faster than llama.cpp)
- Hard prompts (creative, long-form): **4.5-4.9 tok/s**
- llama.cpp baseline: 8.51 tok/s (Qwen2.5-7B-Instruct, Q4_K_M, same hardware)

## Usage

```bash
# Download
huggingface-cli download diffuse-cpp/Dream-v0-Instruct-7B-GGUF dream-7b-q4km.gguf

# Run (requires diffuse-cpp v0.2.0+)
./diffuse-cli -m dream-7b-q4km.gguf -p "What is the capital of France?" -n 64 -s 16
```

## Model Details

- **Architecture:** Qwen2.5-7B backbone with bidirectional attention
- **Parameters:** 7.62B
- **Layers:** 28
- **Hidden size:** 3584
- **Attention:** GQA (28 query heads, 4 KV heads, head dim 128)
- **FFN:** SwiGLU, intermediate size 18944
- **Vocabulary:** 152,064 tokens
- **RoPE theta:** 1,000,000
- **Mask token ID:** 151666
- **Training:** Masked diffusion on text, with autoregressive logit shift

## Conversion Details

Converted from SafeTensors using `convert-dream.py` from diffuse-cpp:
- 339 tensors total (255 weights + 84 QKV biases)
- QKV biases kept at F32 in all quantizations
- Edge layers (first/last) quantized to Q6_K in Q4_K_M scheme

## Citation

```bibtex
@misc{dream2025,
  title={Dream 7B - Scalable Discrete Denoising Diffusion Models for Text Generation},
  author={Ye, Jiacheng and others},
  year={2025}
}
```

## License

Apache 2.0, following the original Dream model license.