dhara-70m / README.md
codelion's picture
Upload README.md with huggingface_hub
823d575 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - text-generation
  - diffusion
  - language-model
  - causal-lm
datasets:
  - codelion/finepdfs-1B
  - codelion/dclm-baseline-1B
  - codelion/fineweb-edu-1B
model-index:
  - name: dhara-70m
    results:
      - task:
          type: text-generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - name: Accuracy
            type: accuracy
            value: 25.58
      - task:
          type: text-generation
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - name: Accuracy
            type: accuracy
            value: 51.58
      - task:
          type: text-generation
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - name: Accuracy
            type: accuracy
            value: 49.64
      - task:
          type: text-generation
        dataset:
          name: ARC-Challenge
          type: arc_challenge
        metrics:
          - name: Accuracy
            type: accuracy
            value: 24.83
      - task:
          type: text-generation
        dataset:
          name: MMLU
          type: mmlu
        metrics:
          - name: Accuracy
            type: accuracy
            value: 23.85
      - task:
          type: text-generation
        dataset:
          name: TruthfulQA
          type: truthfulqa_mc2
        metrics:
          - name: Accuracy
            type: accuracy
            value: 47.5
      - task:
          type: text-generation
        dataset:
          name: GSM8K
          type: gsm8k
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0
      - task:
          type: text-generation
        dataset:
          name: Average
          type: average
        metrics:
          - name: Accuracy
            type: accuracy
            value: 31.85

Dhara-70M

A 70M parameter diffusion language model optimized for high-throughput text generation with superior factuality.

Table of Contents

Model Description

Dhara-70M is a novel diffusion language model that achieves:

  • 3.8x higher throughput than autoregressive models
  • Best-in-class factuality on TruthfulQA (47.50%)
  • 10x training efficiency via WSD (Warmup-Stable-Decay) conversion

Architecture

Specification Value
Parameters 71.34M
Layers 32
Hidden Size 384
FF Dimension 1024
Attention Heads 8
KV Heads 4 (GQA)
Context Length 2048 tokens
Position Encoding RoPE
Normalization RMSNorm
Special Layers Canon (depthwise causal convolutions)
Generation Type Diffusion (parallel token generation)

Training Data

Dhara was trained in two stages:

Stage 1: AR Pretraining (1B tokens)

  • 40% FinePDFs (400M tokens)
  • 30% DCLM Baseline (300M tokens)
  • 30% FineWeb-Edu (300M tokens)

Stage 2: WSD Conversion (100M tokens)

  • Progressive block size warmup (1→4→32→64→1024)
  • MDLM diffusion objective

Training Details

Parameter Value
AR Training Tokens 1 billion
WSD Conversion Tokens 100 million
Batch Size 128 effective (8 × 16 gradient accumulation)
Learning Rate 5e-4 (AR) / 5e-5 (WSD)
Optimizer AdamW
Schedule Cosine decay with 2% warmup
Precision BF16
Hardware Single NVIDIA A40 GPU
Total Training Time ~20 hours

Benchmark Results

Benchmark Dhara-70M GPT-2-70M vs GPT-2
HellaSwag (0-shot) 25.58% 26.46% -0.88%
PIQA (0-shot) 51.58% 58.05% -6.47%
WinoGrande (0-shot) 49.64% 52.64% -3.00%
ARC-Challenge (0-shot) 24.83% 22.27% +2.56%
MMLU (5-shot) 23.85% 25.77% -1.92%
TruthfulQA (0-shot) 47.50% 45.83% +1.67%
GSM8K (5-shot) 0.00% 1.21% -1.21%
Average 31.85% 33.18% -1.33%

Inference Performance

Metric Dhara-70M GPT-2-70M Advantage
Time to First Token 35.5 ms ~25 ms 1.4x slower
Throughput 183.5 tok/s ~48 tok/s 3.8x faster
Peak Memory 0.24 GB 0.15 GB 1.6x higher

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-70m",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output:

The future of artificial intelligence is a big challenge.
This world has the potential to improve, but this time we have no other than "theworld."
The next generation will be more exciting and its very much important for our society's
abilityto develop its

Batch Generation (High Throughput)

# For batch generation, use larger batch sizes
prompts = [
    "The future of artificial intelligence is",
    "The human brain is capable of",
    "Science has shown that",
    "Technology continues to evolve"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
outputs = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)

for i, output in enumerate(outputs):
    print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")

Key Insights

  1. Throughput vs Accuracy Trade-off: Dhara trades 1.33% average accuracy for 3.8x higher throughput, making it ideal for batch processing tasks.

  2. Superior Factuality: Dhara excels on TruthfulQA (+1.67% vs GPT-2), suggesting diffusion models may reduce hallucinations through bidirectional context.

  3. Reasoning Advantage: ARC-Challenge +2.56% indicates strong performance on reasoning tasks.

  4. WSD Efficiency: Converting an AR model to diffusion via WSD uses 10x fewer tokens than training from scratch with equivalent quality.

  5. Canon Layers Help: The depthwise causal convolutions (Canon layers) improve factuality and reasoning with only 0.13% parameter overhead.

Limitations

  • Lower performance on sequential reasoning tasks (GSM8K: 0.00%)
  • Higher memory usage due to bidirectional attention
  • Slightly higher time-to-first-token latency
  • Best suited for batch rather than interactive use cases

Citation

@article{sharma2025optimal,
  title={The Optimal Architecture for Small Language Models},
  author={Sharma, Asankhaya},
  year={2025},
  url={https://huggingface.co/blog/codelion/optimal-model-architecture}
}

Related Work

Contact

For questions or feedback, please open a discussion on the Hugging Face discussions page.