iLLaDA-8B-Base

iLLaDA is an 8B fully bidirectional masked diffusion language model trained from scratch with 12T pre-training tokens, an 8192-token context length, variable-length generation, and confidence-based scoring for multiple-choice evaluation. Inference and evaluation codes: https://github.com/ML-GSAI/LLaDA.

Architecture

iLLaDA 8B LLaDA 8B
Layers 32 32
Model dimension 4096 4096
Attention heads 32 32
Key/Value heads 8 32
FFN dimension 14,336 12,288
Vocabulary size 155,136 126,464
Maximum sequence length 8192 4096
Embedding and LM-head Tied Untied
Total parameters 7.62B 8.02B
Non-embedding parameters 6.98B 6.98B

Benchmark Results of Base Models

iLLaDA 8B LLaDA 8B Dream 7B Qwen2.5 7B
Model Diffusion Diffusion Diffusion AR
Training tokens 12T 2.3T 18T + 0.6T 18T
MMLU 74.8 65.9 69.5 71.9
BBH 71.3 49.7 57.9 63.9
ARC-C 60.8 45.9 59.8 51.5
HellaSwag 76.6 70.5 73.3 79.0
GSM8K 81.9 70.3 77.2 78.9
MATH 38.4 31.4 39.6 41.1
HumanEval 50.0 35.4 57.9 56.7
MBPP 57.8 40.0 56.2 63.6
Average 63.9 51.1 61.4 63.3
Downloads last month
37
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support