codelion commited on
Commit
7d2bb1c
·
verified ·
1 Parent(s): 7f03436

Initial upload of Dhara-70M diffusion language model

Browse files
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-generation
7
+ - diffusion
8
+ - language-model
9
+ - causal-lm
10
+ datasets:
11
+ - HuggingFaceFW/fineweb-edu
12
+ - allenai/dolma
13
+ - mlfoundations/dclm-baseline-1.0
14
+ model-index:
15
+ - name: dhara-70m
16
+ results:
17
+ - task:
18
+ type: text-generation
19
+ dataset:
20
+ name: HellaSwag
21
+ type: hellaswag
22
+ metrics:
23
+ - name: Accuracy
24
+ type: accuracy
25
+ value: 25.58
26
+ - task:
27
+ type: text-generation
28
+ dataset:
29
+ name: PIQA
30
+ type: piqa
31
+ metrics:
32
+ - name: Accuracy
33
+ type: accuracy
34
+ value: 51.58
35
+ - task:
36
+ type: text-generation
37
+ dataset:
38
+ name: WinoGrande
39
+ type: winogrande
40
+ metrics:
41
+ - name: Accuracy
42
+ type: accuracy
43
+ value: 49.64
44
+ - task:
45
+ type: text-generation
46
+ dataset:
47
+ name: ARC-Challenge
48
+ type: arc_challenge
49
+ metrics:
50
+ - name: Accuracy
51
+ type: accuracy
52
+ value: 24.83
53
+ - task:
54
+ type: text-generation
55
+ dataset:
56
+ name: MMLU
57
+ type: mmlu
58
+ metrics:
59
+ - name: Accuracy
60
+ type: accuracy
61
+ value: 23.85
62
+ - task:
63
+ type: text-generation
64
+ dataset:
65
+ name: TruthfulQA
66
+ type: truthfulqa_mc2
67
+ metrics:
68
+ - name: Accuracy
69
+ type: accuracy
70
+ value: 47.50
71
+ ---
72
+
73
+ # Dhara-70M
74
+
75
+ A 70M parameter diffusion language model optimized for high-throughput text generation with superior factuality.
76
+
77
+ ## Table of Contents
78
+ - [Model Description](#model-description)
79
+ - [Training Data](#training-data)
80
+ - [Training Details](#training-details)
81
+ - [Benchmark Results](#benchmark-results)
82
+ - [Usage](#usage)
83
+ - [Key Insights](#key-insights)
84
+ - [Limitations](#limitations)
85
+ - [Citation](#citation)
86
+
87
+ ## Model Description
88
+
89
+ Dhara-70M is a novel diffusion language model that achieves:
90
+ - **3.8x higher throughput** than autoregressive models
91
+ - **Best-in-class factuality** on TruthfulQA (47.50%)
92
+ - **10x training efficiency** via WSD (Warmup-Stable-Decay) conversion
93
+
94
+ ### Architecture
95
+
96
+ | Specification | Value |
97
+ |--------------|-------|
98
+ | **Parameters** | 71.34M |
99
+ | **Layers** | 32 |
100
+ | **Hidden Size** | 384 |
101
+ | **FF Dimension** | 1024 |
102
+ | **Attention Heads** | 8 |
103
+ | **KV Heads** | 4 (GQA) |
104
+ | **Context Length** | 2048 tokens |
105
+ | **Position Encoding** | RoPE |
106
+ | **Normalization** | RMSNorm |
107
+ | **Special Layers** | Canon (depthwise causal convolutions) |
108
+ | **Generation Type** | Diffusion (parallel token generation) |
109
+
110
+ ## Training Data
111
+
112
+ Dhara was trained in two stages:
113
+
114
+ **Stage 1: AR Pretraining (1B tokens)**
115
+ - 40% FinePDFs (400M tokens)
116
+ - 30% DCLM Baseline (300M tokens)
117
+ - 30% FineWeb-Edu (300M tokens)
118
+
119
+ **Stage 2: WSD Conversion (100M tokens)**
120
+ - Progressive block size warmup (1→4→32→64→1024)
121
+ - MDLM diffusion objective
122
+
123
+ ## Training Details
124
+
125
+ | Parameter | Value |
126
+ |-----------|-------|
127
+ | **AR Training Tokens** | 1 billion |
128
+ | **WSD Conversion Tokens** | 100 million |
129
+ | **Batch Size** | 128 effective (8 × 16 gradient accumulation) |
130
+ | **Learning Rate** | 5e-4 (AR) / 5e-5 (WSD) |
131
+ | **Optimizer** | AdamW |
132
+ | **Schedule** | Cosine decay with 2% warmup |
133
+ | **Precision** | BF16 |
134
+ | **Hardware** | Single NVIDIA A40 GPU |
135
+ | **Total Training Time** | ~20 hours |
136
+
137
+ ## Benchmark Results
138
+
139
+ | Benchmark | Dhara-70M | GPT-2-70M | vs GPT-2 |
140
+ |-----------|-----------|-----------|----------|
141
+ | HellaSwag (0-shot) | 25.58% | 26.46% | -0.88% |
142
+ | PIQA (0-shot) | 51.58% | 58.05% | -6.47% |
143
+ | WinoGrande (0-shot) | 49.64% | 52.64% | -3.00% |
144
+ | ARC-Challenge (0-shot) | **24.83%** | 22.27% | **+2.56%** |
145
+ | MMLU (5-shot) | 23.85% | 25.77% | -1.92% |
146
+ | TruthfulQA (0-shot) | **47.50%** | 45.83% | **+1.67%** |
147
+ | GSM8K (5-shot) | 0.00% | 1.21% | -1.21% |
148
+ | **Average** | **31.85%** | **33.18%** | -1.33% |
149
+
150
+ ### Inference Performance
151
+
152
+ | Metric | Dhara-70M | GPT-2-70M | Advantage |
153
+ |--------|-----------|-----------|-----------|
154
+ | Time to First Token | 35.5 ms | ~25 ms | 1.4x slower |
155
+ | Throughput | 183.5 tok/s | ~48 tok/s | **3.8x faster** |
156
+ | Peak Memory | 0.24 GB | 0.15 GB | 1.6x higher |
157
+
158
+ ## Usage
159
+
160
+ ```python
161
+ from transformers import AutoTokenizer, AutoModelForCausalLM
162
+
163
+ # Load model and tokenizer
164
+ tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
165
+ model = AutoModelForCausalLM.from_pretrained("codelion/dhara-70m")
166
+
167
+ # Generate text
168
+ inputs = tokenizer("The future of AI is", return_tensors="pt")
169
+ outputs = model.generate(
170
+ **inputs,
171
+ max_length=50,
172
+ do_sample=True,
173
+ temperature=0.8,
174
+ top_p=0.9,
175
+ pad_token_id=tokenizer.eos_token_id
176
+ )
177
+ print(tokenizer.decode(outputs[0]))
178
+ ```
179
+
180
+ ### Batch Generation (High Throughput)
181
+
182
+ ```python
183
+ # For batch generation, use larger batch sizes
184
+ prompts = [
185
+ "The future of AI is",
186
+ "In recent years, machine learning has",
187
+ "The most important discovery in physics was",
188
+ "Climate change affects our planet by"
189
+ ]
190
+
191
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True)
192
+ outputs = model.generate(
193
+ **inputs,
194
+ max_length=100,
195
+ do_sample=True,
196
+ temperature=0.7,
197
+ num_diffusion_steps=10 # Fewer steps = faster generation
198
+ )
199
+
200
+ for i, output in enumerate(outputs):
201
+ print(f"Prompt {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")
202
+ ```
203
+
204
+ ## Key Insights
205
+
206
+ 1. **Throughput vs Accuracy Trade-off**: Dhara trades 1.33% average accuracy for 3.8x higher throughput, making it ideal for batch processing tasks.
207
+
208
+ 2. **Superior Factuality**: Dhara excels on TruthfulQA (+1.67% vs GPT-2), suggesting diffusion models may reduce hallucinations through bidirectional context.
209
+
210
+ 3. **Reasoning Advantage**: ARC-Challenge +2.56% indicates strong performance on reasoning tasks.
211
+
212
+ 4. **WSD Efficiency**: Converting an AR model to diffusion via WSD uses 10x fewer tokens than training from scratch with equivalent quality.
213
+
214
+ 5. **Canon Layers Help**: The depthwise causal convolutions (Canon layers) improve factuality and reasoning with only 0.13% parameter overhead.
215
+
216
+ ## When to Use Dhara
217
+
218
+ **Choose Dhara when:**
219
+ - Batch generation throughput matters
220
+ - Factual accuracy is critical
221
+ - You have an existing AR checkpoint to convert
222
+
223
+ **Choose AR models when:**
224
+ - Interactive latency is critical
225
+ - Sequential reasoning is important (math, coding)
226
+ - Memory is constrained
227
+
228
+ ## Limitations
229
+
230
+ - Lower performance on sequential reasoning tasks (GSM8K: 0.00%)
231
+ - Higher memory usage due to bidirectional attention
232
+ - Slightly higher time-to-first-token latency
233
+ - Best suited for batch rather than interactive use cases
234
+
235
+ ## Citation
236
+
237
+ ```bibtex
238
+ @article{sharma2025dhara,
239
+ title={Dhara: Optimal Architecture for Efficient Diffusion Language Models},
240
+ author={Sharma, Asankhaya},
241
+ year={2025},
242
+ url={https://huggingface.co/codelion/dhara-70m}
243
+ }
244
+ ```
245
+
246
+ ## Related Work
247
+
248
+ - [Width vs Depth: The Optimal Architecture for Small Language Models](https://huggingface.co/blog/codelion/optimal-architecture) - Blog post describing this work
249
+ - [The 1 Billion Token Challenge: Optimal Dataset Mixing](https://huggingface.co/blog/codelion/optimal-dataset-mixing) - Our previous work on optimal pretraining data
250
+ - [GPT-2-70M](https://huggingface.co/codelion/gpt-2-70m) - Our previous model from optimal pretraining experiments
251
+
252
+ ## Contact
253
+
254
+ For questions or feedback, please open an issue on the [Hugging Face model page](https://huggingface.co/codelion/dhara-70m).
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DharaCanonForMaskedDiffusion"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 1,
7
+ "canon_activation": false,
8
+ "canon_bias": false,
9
+ "canon_kernel": 4,
10
+ "canon_residual": true,
11
+ "canon_set": "AC",
12
+ "eos_token_id": 2,
13
+ "head_dim": 64,
14
+ "hidden_act": "silu",
15
+ "hidden_size": 384,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 1024,
18
+ "mask_epsilon": 0.001,
19
+ "mask_token_id": 50256,
20
+ "max_position_embeddings": 2048,
21
+ "model_type": "dhara_canon",
22
+ "num_attention_heads": 6,
23
+ "num_diffusion_steps": 1000,
24
+ "num_hidden_layers": 32,
25
+ "num_key_value_heads": 6,
26
+ "pad_token_id": 0,
27
+ "rms_norm_eps": 1e-05,
28
+ "rope_theta": 10000.0,
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.55.2",
31
+ "use_cache": false,
32
+ "use_flash_attention": false,
33
+ "use_xformers": false,
34
+ "vocab_size": 50257
35
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.55.2",
7
+ "use_cache": false
8
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:138820db3e8e59ed037924f14f9739ca9667e406465fc236fa9765691386f5fc
3
+ size 304219496
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": false,
15
+ "eos_token": "<|endoftext|>",
16
+ "extra_special_tokens": {},
17
+ "model_max_length": 1024,
18
+ "pad_token": "<|endoftext|>",
19
+ "tokenizer_class": "GPT2Tokenizer",
20
+ "unk_token": "<|endoftext|>"
21
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff