|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-generation |
|
|
- diffusion |
|
|
- language-model |
|
|
- causal-lm |
|
|
datasets: |
|
|
- codelion/finepdfs-1B |
|
|
- codelion/dclm-baseline-1B |
|
|
- codelion/fineweb-edu-1B |
|
|
model-index: |
|
|
- name: dhara-70m |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: HellaSwag |
|
|
type: hellaswag |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 25.58 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: PIQA |
|
|
type: piqa |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 51.58 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: WinoGrande |
|
|
type: winogrande |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 49.64 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: ARC-Challenge |
|
|
type: arc_challenge |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 24.83 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: MMLU |
|
|
type: mmlu |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 23.85 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: TruthfulQA |
|
|
type: truthfulqa_mc2 |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 47.50 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: GSM8K |
|
|
type: gsm8k |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 0.00 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: Average |
|
|
type: average |
|
|
metrics: |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 31.85 |
|
|
--- |
|
|
|
|
|
# Dhara-70M |
|
|
|
|
|
A 70M parameter diffusion language model optimized for high-throughput text generation with superior factuality. |
|
|
|
|
|
## Table of Contents |
|
|
- [Model Description](#model-description) |
|
|
- [Training Data](#training-data) |
|
|
- [Training Details](#training-details) |
|
|
- [Benchmark Results](#benchmark-results) |
|
|
- [Usage](#usage) |
|
|
- [Key Insights](#key-insights) |
|
|
- [Limitations](#limitations) |
|
|
- [Citation](#citation) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Dhara-70M is a novel diffusion language model that achieves: |
|
|
- **3.8x higher throughput** than autoregressive models |
|
|
- **Best-in-class factuality** on TruthfulQA (47.50%) |
|
|
- **10x training efficiency** via WSD (Warmup-Stable-Decay) conversion |
|
|
|
|
|
### Architecture |
|
|
|
|
|
| Specification | Value | |
|
|
|--------------|-------| |
|
|
| **Parameters** | 71.34M | |
|
|
| **Layers** | 32 | |
|
|
| **Hidden Size** | 384 | |
|
|
| **FF Dimension** | 1024 | |
|
|
| **Attention Heads** | 8 | |
|
|
| **KV Heads** | 4 (GQA) | |
|
|
| **Context Length** | 2048 tokens | |
|
|
| **Position Encoding** | RoPE | |
|
|
| **Normalization** | RMSNorm | |
|
|
| **Special Layers** | Canon (depthwise causal convolutions) | |
|
|
| **Generation Type** | Diffusion (parallel token generation) | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Dhara was trained in two stages: |
|
|
|
|
|
**Stage 1: AR Pretraining (1B tokens)** |
|
|
- 40% FinePDFs (400M tokens) |
|
|
- 30% DCLM Baseline (300M tokens) |
|
|
- 30% FineWeb-Edu (300M tokens) |
|
|
|
|
|
**Stage 2: WSD Conversion (100M tokens)** |
|
|
- Progressive block size warmup (1→4→32→64→1024) |
|
|
- MDLM diffusion objective |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **AR Training Tokens** | 1 billion | |
|
|
| **WSD Conversion Tokens** | 100 million | |
|
|
| **Batch Size** | 128 effective (8 × 16 gradient accumulation) | |
|
|
| **Learning Rate** | 5e-4 (AR) / 5e-5 (WSD) | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Schedule** | Cosine decay with 2% warmup | |
|
|
| **Precision** | BF16 | |
|
|
| **Hardware** | Single NVIDIA A40 GPU | |
|
|
| **Total Training Time** | ~20 hours | |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
| Benchmark | Dhara-70M | GPT-2-70M | vs GPT-2 | |
|
|
|-----------|-----------|-----------|----------| |
|
|
| HellaSwag (0-shot) | 25.58% | 26.46% | -0.88% | |
|
|
| PIQA (0-shot) | 51.58% | 58.05% | -6.47% | |
|
|
| WinoGrande (0-shot) | 49.64% | 52.64% | -3.00% | |
|
|
| ARC-Challenge (0-shot) | **24.83%** | 22.27% | **+2.56%** | |
|
|
| MMLU (5-shot) | 23.85% | 25.77% | -1.92% | |
|
|
| TruthfulQA (0-shot) | **47.50%** | 45.83% | **+1.67%** | |
|
|
| GSM8K (5-shot) | 0.00% | 1.21% | -1.21% | |
|
|
| **Average** | **31.85%** | **33.18%** | -1.33% | |
|
|
|
|
|
### Inference Performance |
|
|
|
|
|
| Metric | Dhara-70M | GPT-2-70M | Advantage | |
|
|
|--------|-----------|-----------|-----------| |
|
|
| Time to First Token | 35.5 ms | ~25 ms | 1.4x slower | |
|
|
| Throughput | 183.5 tok/s | ~48 tok/s | **3.8x faster** | |
|
|
| Peak Memory | 0.24 GB | 0.15 GB | 1.6x higher | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m") |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"codelion/dhara-70m", |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
# Move to GPU if available |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = model.to(device) |
|
|
|
|
|
# Generate text |
|
|
prompt = "The future of artificial intelligence is" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(device) |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=50, |
|
|
temperature=0.1, |
|
|
top_p=0.5, |
|
|
top_k=5, |
|
|
repetition_penalty=1.8, |
|
|
do_sample=True, |
|
|
pad_token_id=0 |
|
|
) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
**Example Output:** |
|
|
``` |
|
|
The future of artificial intelligence is a big challenge. |
|
|
This world has the potential to improve, but this time we have no other than "theworld." |
|
|
The next generation will be more exciting and its very much important for our society's |
|
|
abilityto develop its |
|
|
``` |
|
|
|
|
|
### Batch Generation (High Throughput) |
|
|
|
|
|
```python |
|
|
# For batch generation, use larger batch sizes |
|
|
prompts = [ |
|
|
"The future of artificial intelligence is", |
|
|
"The human brain is capable of", |
|
|
"Science has shown that", |
|
|
"Technology continues to evolve" |
|
|
] |
|
|
|
|
|
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device) |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
attention_mask=inputs.attention_mask, |
|
|
max_new_tokens=50, |
|
|
temperature=0.1, |
|
|
top_p=0.5, |
|
|
top_k=5, |
|
|
repetition_penalty=1.8, |
|
|
do_sample=True, |
|
|
pad_token_id=0 |
|
|
) |
|
|
|
|
|
for i, output in enumerate(outputs): |
|
|
print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}") |
|
|
``` |
|
|
|
|
|
## Key Insights |
|
|
|
|
|
1. **Throughput vs Accuracy Trade-off**: Dhara trades 1.33% average accuracy for 3.8x higher throughput, making it ideal for batch processing tasks. |
|
|
|
|
|
2. **Superior Factuality**: Dhara excels on TruthfulQA (+1.67% vs GPT-2), suggesting diffusion models may reduce hallucinations through bidirectional context. |
|
|
|
|
|
3. **Reasoning Advantage**: ARC-Challenge +2.56% indicates strong performance on reasoning tasks. |
|
|
|
|
|
4. **WSD Efficiency**: Converting an AR model to diffusion via WSD uses 10x fewer tokens than training from scratch with equivalent quality. |
|
|
|
|
|
5. **Canon Layers Help**: The depthwise causal convolutions (Canon layers) improve factuality and reasoning with only 0.13% parameter overhead. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Lower performance on sequential reasoning tasks (GSM8K: 0.00%) |
|
|
- Higher memory usage due to bidirectional attention |
|
|
- Slightly higher time-to-first-token latency |
|
|
- Best suited for batch rather than interactive use cases |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{sharma2025optimal, |
|
|
title={The Optimal Architecture for Small Language Models}, |
|
|
author={Sharma, Asankhaya}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/blog/codelion/optimal-model-architecture} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Work |
|
|
|
|
|
- [The Optimal Architecture for Small Language Models](https://huggingface.co/blog/codelion/optimal-model-architecture) - Blog post describing this work |
|
|
- [The 1 Billion Token Challenge: Optimal Dataset Mixing](https://huggingface.co/blog/codelion/optimal-dataset-mixing) - Our previous work on optimal pretraining data |
|
|
- [GPT-2-70M](https://huggingface.co/codelion/gpt-2-70m) - Our previous model from optimal pretraining experiments |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or feedback, please open a discussion on the [Hugging Face discussions page](https://huggingface.co/codelion/dhara-70m/discussions). |
|
|
|