codelion commited on
Commit
95af59a
·
verified ·
1 Parent(s): d20925d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +240 -372
README.md CHANGED
@@ -1,410 +1,278 @@
1
- # Dhara: Masked Diffusion Language Models
2
-
3
- [![Python 3.8+](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-380/)
4
- [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
5
- [![HuggingFace](https://img.shields.io/badge/🤗-Transformers-yellow.svg)](https://huggingface.co/transformers/)
6
- [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
7
-
8
- **Dhara** is a state-of-the-art implementation of Masked Diffusion Models (MDM) for language generation, based on the paper ["Diffusion Beats Autoregressive in Data-Constrained Settings"](https://arxiv.org/abs/2410.16686). This implementation provides full **HuggingFace integration**, multiple model sizes, and paper-accurate training procedures.
9
-
10
- ## 🌟 Features
11
-
12
- - 🚀 **Full HuggingFace Integration**: Works directly with `AutoModel.from_pretrained()`
13
- - 🎯 **Multiple Model Sizes**: Support for 135M and 600M parameter models
14
- - 🔄 **Bidirectional Attention**: Enables parallel token generation
15
- - **Optimized Inference**: Multiple generation strategies with configurable steps
16
- - 📊 **Comprehensive Evaluation**: Built-in benchmarking on 9 standard tasks
17
- - 🧠 **Paper-Accurate Implementation**: Exact replication of the original research
18
- - 🛠 **Production Ready**: Modular, extensible, and well-documented codebase
19
-
20
- ## 📋 Quick Start
21
-
22
- ### Installation
23
-
24
- #### For Training and Inference
25
- ```bash
26
- pip install -r requirements.txt
27
- ```
28
-
29
- #### For Evaluation (includes lm-eval harness)
30
- ```bash
31
- pip install -r requirements-eval.txt
32
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ### Direct HuggingFace Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```python
37
- from transformers import AutoModel, AutoTokenizer
 
38
 
39
  # Load model and tokenizer
40
- model = AutoModel.from_pretrained(
41
- "your-org/dhara-135m",
42
- trust_remote_code=True
43
- )
44
- tokenizer = AutoTokenizer.from_pretrained(
45
- "your-org/dhara-135m",
46
- trust_remote_code=True
47
  )
48
 
49
- # Generate text
50
- inputs = tokenizer("The future of AI is", return_tensors="pt")
51
- outputs = model.generate(**inputs, max_new_tokens=50, num_diffusion_steps=10)
52
- print(tokenizer.decode(outputs[0]))
53
- ```
54
-
55
- ### Custom Loading
56
-
57
- ```python
58
- from dhara import DharaForMaskedDiffusion, DharaTokenizer, DharaConfig
59
-
60
- # Load with custom config
61
- config = DharaConfig(model_size="dhara-135m")
62
- model = DharaForMaskedDiffusion(config)
63
- tokenizer = DharaTokenizer(model_size="dhara-135m")
64
-
65
- # Generate with diffusion
66
- text = "The future of artificial intelligence"
67
- inputs = tokenizer(text, return_tensors="pt")
68
- outputs = model.generate(**inputs, max_new_tokens=100, num_diffusion_steps=20)
69
- generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
70
- print(generated_text)
71
- ```
72
 
73
- ## 🏗️ Model Architecture
74
-
75
- ### Available Models
76
-
77
- | Model | Parameters | Base Architecture | Vocabulary | Max Length | Context |
78
- |-------|------------|-------------------|------------|------------|---------|
79
- | **dhara-135m** | 135M | SmolLM2-135M | 49,153 | 8,192 | Compact, efficient |
80
- | **dhara-600m** | 600M | Qwen3-0.6B | 151,937 | 32,768 | High capacity |
81
-
82
- ### Key Architectural Features
83
-
84
- - **Bidirectional Attention**: Unlike causal models, Dhara can attend to future tokens
85
- - **Masked Diffusion**: Uses `[MASK]` tokens instead of random noise for training
86
- - **Unified Implementation**: Single codebase supports multiple model sizes
87
- - **HF Compatible**: Standard transformer architecture with diffusion adaptations
88
-
89
- ### Technical Specifications
90
-
91
- #### Dhara-135M
92
- - **Architecture**: Based on SmolLM2-135M
93
- - **Layers**: 30 transformer layers
94
- - **Attention Heads**: 9 (with 3 key-value heads using GQA)
95
- - **Hidden Size**: 576
96
- - **Intermediate Size**: 1,536
97
- - **Position Embeddings**: RoPE (θ=10,000)
98
- - **Normalization**: RMSNorm (ε=1e-5)
99
- - **Activation**: SiLU (Swish)
100
-
101
- #### Dhara-600M
102
- - **Architecture**: Based on Qwen3-0.6B
103
- - **Layers**: 28 transformer layers
104
- - **Attention Heads**: 16 (with 8 key-value heads using GQA)
105
- - **Hidden Size**: 1,024
106
- - **Intermediate Size**: 3,072
107
- - **Position Embeddings**: RoPE (θ=1,000,000)
108
- - **Normalization**: RMSNorm (ε=1e-6)
109
- - **Activation**: SiLU (Swish)
110
-
111
- ## 🚂 Training
112
-
113
- ### Quick Training
114
-
115
- Train Dhara-135M:
116
- ```bash
117
- python train_dhara.py \
118
- --model_size dhara-135m \
119
- --dataset_name codelion/dclm-baseline-100M \
120
- --num_epochs 100 \
121
- --batch_size 8 \
122
- --gradient_accumulation_steps 16 \
123
- --learning_rate 2e-4 \
124
- --use_flash_attention \
125
- --bf16 \
126
- --save_every_epoch \
127
- --use_wandb
128
- ```
129
-
130
- Train Dhara-600M:
131
- ```bash
132
- python train_dhara.py \
133
- --model_size dhara-600m \
134
- --dataset_name codelion/dclm-baseline-100M \
135
- --num_epochs 100 \
136
- --batch_size 4 \
137
- --gradient_accumulation_steps 32 \
138
- --learning_rate 2e-4 \
139
- --gradient_checkpointing \
140
- --bf16
141
- ```
142
-
143
- ### Advanced Training Options
144
-
145
- ```bash
146
- python train_dhara.py \
147
- --model_size dhara-135m \
148
- --dataset_name your_dataset \
149
- --num_epochs 50 \
150
- --batch_size 8 \
151
- --gradient_accumulation_steps 16 \
152
- --max_length 4096 \
153
- --learning_rate 2e-4 \
154
- --warmup_steps 5000 \
155
- --weight_decay 0.01 \
156
- --use_flash_attention \
157
- --gradient_checkpointing \
158
- --use_8bit_adam \
159
- --bf16 \
160
- --tf32 \
161
- --save_every_epoch \
162
- --eval_epochs 5 \
163
- --auto_resume \
164
- --use_wandb \
165
- --run_name my-dhara-experiment \
166
- --output_dir ./my_dhara_model
167
  ```
168
 
169
- ### Training Parameters
170
-
171
- | Parameter | Description | Default | Recommended |
172
- |-----------|-------------|---------|-------------|
173
- | `--model_size` | Model size to train | `dhara-135m` | `dhara-135m` or `dhara-600m` |
174
- | `--dataset_name` | HuggingFace dataset | `codelion/dclm-baseline-100M` | Any text dataset |
175
- | `--num_epochs` | Training epochs | 50 | 50-100 |
176
- | `--learning_rate` | Learning rate | 2e-4 | 2e-4 (optimal) |
177
- | `--batch_size` | Batch size per GPU | 8 | 4-16 depending on GPU |
178
- | `--gradient_accumulation_steps` | Gradient accumulation | 16 | 16-32 |
179
- | `--use_flash_attention` | Use Flash Attention 2 | False | True (for speed) |
180
- | `--gradient_checkpointing` | Memory optimization | False | True (for 600M) |
181
- | `--bf16` | Use bfloat16 precision | True | True (recommended) |
182
-
183
- ## 📊 Evaluation
184
-
185
- ### Quick Evaluation
186
-
187
- Run all benchmarks:
188
- ```bash
189
- ./benchmark_dhara.sh /path/to/checkpoint dhara-135m ./results
190
  ```
191
-
192
- Run specific benchmark:
193
- ```bash
194
- python eval_dhara.py \
195
- --checkpoint /path/to/checkpoint \
196
- --model_size dhara-135m \
197
- --task hellaswag \
198
- --batch_size 8
199
  ```
200
 
201
- ### Benchmark Tasks
202
-
203
- Dhara is evaluated on 9 standard language modeling benchmarks:
204
-
205
- 1. **HellaSwag** (0-shot) - Common sense reasoning
206
- 2. **ARC-Easy** (0-shot) - Grade school science questions
207
- 3. **ARC-Challenge** (0-shot) - More difficult science questions
208
- 4. **PIQA** (0-shot) - Physical reasoning
209
- 5. **MMLU** (5-shot) - Multitask language understanding
210
- 6. **CommonsenseQA** (0-shot) - Common sense Q&A
211
- 7. **TriviaQA** (5-shot) - Reading comprehension
212
- 8. **Winogrande** (0-shot) - Pronoun resolution
213
- 9. **GSM8K** (5-shot) - Grade school math
214
-
215
- ### Expected Performance
216
-
217
- Performance targets based on the original paper results:
218
-
219
- | Model | HellaSwag | ARC-E | PIQA | Average | Status |
220
- |-------|-----------|-------|------|---------|---------|
221
- | **Random Baseline** | 25.0% | 25.0% | 50.0% | 33.3% | Reference |
222
- | **Paper (100M tokens)** | 30.2% | 37.8% | 60.7% | 42.9% | Target |
223
- | **Dhara-135M** | TBD | TBD | TBD | TBD | In Progress |
224
- | **Dhara-600M** | TBD | TBD | TBD | TBD | Planned |
225
-
226
- ### Success Criteria
227
-
228
- - **🎯 Excellent**: Within 2% of paper's results
229
- - **✅ Good**: Within 5% of paper's results
230
- - **👍 Acceptable**: Beats random baseline by >10 points
231
- - **⚠️ Poor**: Beats random baseline by <5 points
232
-
233
- ## 🔬 Technical Details
234
-
235
- ### Masked Diffusion Process
236
-
237
- Dhara uses a novel **Masked Diffusion** approach instead of traditional autoregressive generation:
238
-
239
- 1. **Training**: Randomly mask tokens with `[MASK]` based on diffusion timestep
240
- 2. **Loss**: Compute cross-entropy only on masked positions with importance weighting
241
- 3. **Inference**: Iteratively unmask tokens based on model confidence
242
-
243
- ### Key Differences from Standard LLMs
244
-
245
- | Aspect | Autoregressive | Dhara (MDM) |
246
- |--------|---------------|-------------|
247
- | **Training Objective** | Next-token prediction | Masked token reconstruction |
248
- | **Attention** | Causal (left-to-right) | Bidirectional (all positions) |
249
- | **Generation** | Sequential | Parallel (configurable steps) |
250
- | **Context** | Left context only | Full bidirectional context |
251
- | **Speed** | Fixed (1 token/step) | Variable (multiple tokens/step) |
252
-
253
- ### Generation Strategies
254
-
255
- Dhara supports multiple generation strategies:
256
-
257
- - **MDM Parallel**: Update all masked tokens simultaneously (fastest)
258
- - **Confidence-based**: Update most confident tokens first (highest quality)
259
- - **Hybrid**: Combine parallel and confidence-based approaches
260
-
261
- ### Performance Optimizations
262
-
263
- - **Flash Attention 2**: 2-4x speedup on modern GPUs
264
- - **Gradient Checkpointing**: Reduce memory usage for large models
265
- - **Mixed Precision**: BF16/FP16 training support
266
- - **8-bit Optimizers**: Reduce optimizer memory usage
267
- - **Torch Compile**: JIT compilation for inference speedup
268
-
269
- ## 📁 File Structure
270
-
271
- ```
272
- dhara/
273
- ├── configuration_dhara.py # Model configurations
274
- ├── modeling_dhara.py # Core model implementation
275
- ├── tokenization_dhara.py # Custom tokenizer with [MASK]
276
- ├── train_dhara.py # Training script
277
- ├── eval_dhara.py # Evaluation wrapper
278
- ├── dhara_inference.py # Inference utilities
279
- ├── benchmark_dhara.sh # Benchmark script
280
- ├── __init__.py # HuggingFace registration
281
- ├── README.md # This file
282
- ├── requirements.txt # Core dependencies
283
- ├── requirements-eval.txt # Evaluation dependencies
284
- └── examples/ # Usage examples
285
- ```
286
-
287
- ## 🔧 Advanced Usage
288
-
289
- ### Custom Model Configuration
290
 
291
  ```python
292
- from dhara import DharaConfig, DharaForMaskedDiffusion
293
-
294
- # Create custom configuration
295
- config = DharaConfig(
296
- model_size="custom",
297
- vocab_size=50000,
298
- hidden_size=768,
299
- num_hidden_layers=12,
300
- num_attention_heads=12,
301
- max_position_embeddings=2048,
302
- use_flash_attention=True,
 
 
 
 
 
 
 
 
303
  )
304
 
305
- model = DharaForMaskedDiffusion(config)
306
- ```
307
-
308
- ### Fine-tuning
309
-
310
- ```python
311
- # Load pre-trained model
312
- model = DharaForMaskedDiffusion.from_pretrained("your-org/dhara-135m")
313
-
314
- # Fine-tune on your dataset
315
- trainer = Trainer(
316
- model=model,
317
- train_dataset=your_dataset,
318
- tokenizer=tokenizer,
319
- # ... other training arguments
320
- )
321
- trainer.train()
322
  ```
323
 
324
- ### Custom Generation
325
 
326
- ```python
327
- from dhara import DharaInference
328
-
329
- # Initialize inference engine
330
- inference = DharaInference("path/to/checkpoint")
331
-
332
- # Generate with custom parameters
333
- text = inference.generate(
334
- prompt="The future of AI",
335
- max_new_tokens=100,
336
- num_diffusion_steps=20,
337
- temperature=0.8,
338
- strategy="confidence" # or "parallel"
339
- )
340
- ```
341
 
342
- ### Distributed Training
343
 
344
- ```bash
345
- torchrun --nproc_per_node=4 train_dhara.py \
346
- --model_size dhara-600m \
347
- --dataset_name your_dataset \
348
- --batch_size 2 \
349
- --gradient_accumulation_steps 64
350
- ```
351
 
352
- ## 🤝 Contributing
353
 
354
- We welcome contributions! Please see our [contributing guidelines](CONTRIBUTING.md) for details.
355
 
356
- ### Development Setup
357
 
358
- ```bash
359
- git clone https://github.com/your-org/dhara.git
360
- cd dhara
361
- pip install -e .
362
- pip install -r requirements-dev.txt
363
- ```
364
 
365
- ### Running Tests
 
 
 
366
 
367
- ```bash
368
- pytest tests/
369
- python -m dhara # Test HF integration
370
- ```
371
 
372
- ## 📖 Citation
 
 
 
373
 
374
- If you use Dhara in your research, please cite the original paper:
375
 
376
  ```bibtex
377
- @article{ghosal2024diffusion,
378
- title={Diffusion Beats Autoregressive in Data-Constrained Settings},
379
- author={Ghosal, Samarth and others},
380
- journal={arXiv preprint arXiv:2410.16686},
381
- year={2024}
382
  }
383
  ```
384
 
385
- ## 📄 License
386
-
387
- This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
388
-
389
- ## 🙏 Acknowledgments
390
-
391
- - Original paper authors for the MDM methodology
392
- - HuggingFace team for the transformers library
393
- - SmolLM2 and Qwen3 teams for the base architectures
394
- - The open source community for valuable feedback
395
-
396
- ## 📞 Support
397
-
398
- - **Issues**: [GitHub Issues](https://github.com/your-org/dhara/issues)
399
- - **Discussions**: [GitHub Discussions](https://github.com/your-org/dhara/discussions)
400
- - **Email**: support@your-org.com
401
-
402
- ---
403
-
404
- <div align="center">
405
 
406
- **🌟 Star us on GitHub if you find Dhara useful! 🌟**
 
 
407
 
408
- Made with ❤️ by the Dhara team
409
 
410
- </div>
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-generation
7
+ - diffusion
8
+ - language-model
9
+ - causal-lm
10
+ datasets:
11
+ - HuggingFaceFW/fineweb-edu
12
+ - allenai/dolma
13
+ - mlfoundations/dclm-baseline-1.0
14
+ model-index:
15
+ - name: dhara-70m
16
+ results:
17
+ - task:
18
+ type: text-generation
19
+ dataset:
20
+ name: HellaSwag
21
+ type: hellaswag
22
+ metrics:
23
+ - name: Accuracy
24
+ type: accuracy
25
+ value: 25.58
26
+ - task:
27
+ type: text-generation
28
+ dataset:
29
+ name: PIQA
30
+ type: piqa
31
+ metrics:
32
+ - name: Accuracy
33
+ type: accuracy
34
+ value: 51.58
35
+ - task:
36
+ type: text-generation
37
+ dataset:
38
+ name: WinoGrande
39
+ type: winogrande
40
+ metrics:
41
+ - name: Accuracy
42
+ type: accuracy
43
+ value: 49.64
44
+ - task:
45
+ type: text-generation
46
+ dataset:
47
+ name: ARC-Challenge
48
+ type: arc_challenge
49
+ metrics:
50
+ - name: Accuracy
51
+ type: accuracy
52
+ value: 24.83
53
+ - task:
54
+ type: text-generation
55
+ dataset:
56
+ name: MMLU
57
+ type: mmlu
58
+ metrics:
59
+ - name: Accuracy
60
+ type: accuracy
61
+ value: 23.85
62
+ - task:
63
+ type: text-generation
64
+ dataset:
65
+ name: TruthfulQA
66
+ type: truthfulqa_mc2
67
+ metrics:
68
+ - name: Accuracy
69
+ type: accuracy
70
+ value: 47.50
71
+ ---
72
 
73
+ # Dhara-70M
74
+
75
+ A 70M parameter diffusion language model optimized for high-throughput text generation with superior factuality.
76
+
77
+ ## Table of Contents
78
+ - [Model Description](#model-description)
79
+ - [Training Data](#training-data)
80
+ - [Training Details](#training-details)
81
+ - [Benchmark Results](#benchmark-results)
82
+ - [Usage](#usage)
83
+ - [Key Insights](#key-insights)
84
+ - [Limitations](#limitations)
85
+ - [Citation](#citation)
86
+
87
+ ## Model Description
88
+
89
+ Dhara-70M is a novel diffusion language model that achieves:
90
+ - **3.8x higher throughput** than autoregressive models
91
+ - **Best-in-class factuality** on TruthfulQA (47.50%)
92
+ - **10x training efficiency** via WSD (Warmup-Stable-Decay) conversion
93
+
94
+ ### Architecture
95
+
96
+ | Specification | Value |
97
+ |--------------|-------|
98
+ | **Parameters** | 71.34M |
99
+ | **Layers** | 32 |
100
+ | **Hidden Size** | 384 |
101
+ | **FF Dimension** | 1024 |
102
+ | **Attention Heads** | 8 |
103
+ | **KV Heads** | 4 (GQA) |
104
+ | **Context Length** | 2048 tokens |
105
+ | **Position Encoding** | RoPE |
106
+ | **Normalization** | RMSNorm |
107
+ | **Special Layers** | Canon (depthwise causal convolutions) |
108
+ | **Generation Type** | Diffusion (parallel token generation) |
109
+
110
+ ## Training Data
111
+
112
+ Dhara was trained in two stages:
113
+
114
+ **Stage 1: AR Pretraining (1B tokens)**
115
+ - 40% FinePDFs (400M tokens)
116
+ - 30% DCLM Baseline (300M tokens)
117
+ - 30% FineWeb-Edu (300M tokens)
118
+
119
+ **Stage 2: WSD Conversion (100M tokens)**
120
+ - Progressive block size warmup (1→4→32→64→1024)
121
+ - MDLM diffusion objective
122
+
123
+ ## Training Details
124
+
125
+ | Parameter | Value |
126
+ |-----------|-------|
127
+ | **AR Training Tokens** | 1 billion |
128
+ | **WSD Conversion Tokens** | 100 million |
129
+ | **Batch Size** | 128 effective (8 × 16 gradient accumulation) |
130
+ | **Learning Rate** | 5e-4 (AR) / 5e-5 (WSD) |
131
+ | **Optimizer** | AdamW |
132
+ | **Schedule** | Cosine decay with 2% warmup |
133
+ | **Precision** | BF16 |
134
+ | **Hardware** | Single NVIDIA A40 GPU |
135
+ | **Total Training Time** | ~20 hours |
136
+
137
+ ## Benchmark Results
138
+
139
+ | Benchmark | Dhara-70M | GPT-2-70M | vs GPT-2 |
140
+ |-----------|-----------|-----------|----------|
141
+ | HellaSwag (0-shot) | 25.58% | 26.46% | -0.88% |
142
+ | PIQA (0-shot) | 51.58% | 58.05% | -6.47% |
143
+ | WinoGrande (0-shot) | 49.64% | 52.64% | -3.00% |
144
+ | ARC-Challenge (0-shot) | **24.83%** | 22.27% | **+2.56%** |
145
+ | MMLU (5-shot) | 23.85% | 25.77% | -1.92% |
146
+ | TruthfulQA (0-shot) | **47.50%** | 45.83% | **+1.67%** |
147
+ | GSM8K (5-shot) | 0.00% | 1.21% | -1.21% |
148
+ | **Average** | **31.85%** | **33.18%** | -1.33% |
149
+
150
+ ### Inference Performance
151
+
152
+ | Metric | Dhara-70M | GPT-2-70M | Advantage |
153
+ |--------|-----------|-----------|-----------|
154
+ | Time to First Token | 35.5 ms | ~25 ms | 1.4x slower |
155
+ | Throughput | 183.5 tok/s | ~48 tok/s | **3.8x faster** |
156
+ | Peak Memory | 0.24 GB | 0.15 GB | 1.6x higher |
157
+
158
+ ## Usage
159
 
160
  ```python
161
+ import torch
162
+ from transformers import AutoTokenizer, AutoModelForCausalLM
163
 
164
  # Load model and tokenizer
165
+ tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
166
+ model = AutoModelForCausalLM.from_pretrained(
167
+ "codelion/dhara-70m",
168
+ trust_remote_code=True,
169
+ torch_dtype=torch.bfloat16
 
 
170
  )
171
 
172
+ # Move to GPU if available
173
+ device = "cuda" if torch.cuda.is_available() else "cpu"
174
+ model = model.to(device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
+ # Generate text
177
+ prompt = "The future of artificial intelligence is"
178
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
179
+ outputs = model.generate(
180
+ inputs.input_ids,
181
+ max_new_tokens=50,
182
+ temperature=0.1,
183
+ top_p=0.5,
184
+ top_k=5,
185
+ repetition_penalty=1.8,
186
+ do_sample=True,
187
+ pad_token_id=0
188
+ )
189
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  ```
191
 
192
+ **Example Output:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  ```
194
+ The future of artificial intelligence is a big challenge.
195
+ This world has the potential to improve, but this time we have no other than "theworld."
196
+ The next generation will be more exciting and its very much important for our society's
197
+ abilityto develop its
 
 
 
 
198
  ```
199
 
200
+ ### Batch Generation (High Throughput)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
  ```python
203
+ # For batch generation, use larger batch sizes
204
+ prompts = [
205
+ "The future of artificial intelligence is",
206
+ "The human brain is capable of",
207
+ "Science has shown that",
208
+ "Technology continues to evolve"
209
+ ]
210
+
211
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
212
+ outputs = model.generate(
213
+ inputs.input_ids,
214
+ attention_mask=inputs.attention_mask,
215
+ max_new_tokens=50,
216
+ temperature=0.1,
217
+ top_p=0.5,
218
+ top_k=5,
219
+ repetition_penalty=1.8,
220
+ do_sample=True,
221
+ pad_token_id=0
222
  )
223
 
224
+ for i, output in enumerate(outputs):
225
+ print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  ```
227
 
228
+ ## Key Insights
229
 
230
+ 1. **Throughput vs Accuracy Trade-off**: Dhara trades 1.33% average accuracy for 3.8x higher throughput, making it ideal for batch processing tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
+ 2. **Superior Factuality**: Dhara excels on TruthfulQA (+1.67% vs GPT-2), suggesting diffusion models may reduce hallucinations through bidirectional context.
233
 
234
+ 3. **Reasoning Advantage**: ARC-Challenge +2.56% indicates strong performance on reasoning tasks.
 
 
 
 
 
 
235
 
236
+ 4. **WSD Efficiency**: Converting an AR model to diffusion via WSD uses 10x fewer tokens than training from scratch with equivalent quality.
237
 
238
+ 5. **Canon Layers Help**: The depthwise causal convolutions (Canon layers) improve factuality and reasoning with only 0.13% parameter overhead.
239
 
240
+ ## When to Use Dhara
241
 
242
+ **Choose Dhara when:**
243
+ - Batch generation throughput matters
244
+ - Factual accuracy is critical
245
+ - You have an existing AR checkpoint to convert
 
 
246
 
247
+ **Choose AR models when:**
248
+ - Interactive latency is critical
249
+ - Sequential reasoning is important (math, coding)
250
+ - Memory is constrained
251
 
252
+ ## Limitations
 
 
 
253
 
254
+ - Lower performance on sequential reasoning tasks (GSM8K: 0.00%)
255
+ - Higher memory usage due to bidirectional attention
256
+ - Slightly higher time-to-first-token latency
257
+ - Best suited for batch rather than interactive use cases
258
 
259
+ ## Citation
260
 
261
  ```bibtex
262
+ @article{sharma2025optimal,
263
+ title={The Optimal Architecture for Small Language Models},
264
+ author={Sharma, Asankhaya},
265
+ year={2025},
266
+ url={https://huggingface.co/blog/codelion/optimal-model-architecture}
267
  }
268
  ```
269
 
270
+ ## Related Work
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
271
 
272
+ - [The Optimal Architecture for Small Language Models](https://huggingface.co/blog/codelion/optimal-model-architecture) - Blog post describing this work
273
+ - [The 1 Billion Token Challenge: Optimal Dataset Mixing](https://huggingface.co/blog/codelion/optimal-dataset-mixing) - Our previous work on optimal pretraining data
274
+ - [GPT-2-70M](https://huggingface.co/codelion/gpt-2-70m) - Our previous model from optimal pretraining experiments
275
 
276
+ ## Contact
277
 
278
+ For questions or feedback, please open a discussion on the [Hugging Face discussions page](https://huggingface.co/codelion/dhara-70m/discussions).