iLLaDA-8B-Base

iLLaDA is an 8B fully bidirectional masked diffusion language model trained from scratch with 12T pre-training tokens, an 8192-token context length, variable-length generation, and confidence-based scoring for multiple-choice evaluation. Inference and evaluation codes: https://github.com/ML-GSAI/LLaDA.

Architecture

	iLLaDA 8B	LLaDA 8B
Layers	32	32
Model dimension	4096	4096
Attention heads	32	32
Key/Value heads	8	32
FFN dimension	14,336	12,288
Vocabulary size	155,136	126,464
Maximum sequence length	8192	4096
Embedding and LM-head	Tied	Untied
Total parameters	7.62B	8.02B
Non-embedding parameters	6.98B	6.98B

Benchmark Results of Base Models

	iLLaDA 8B	LLaDA 8B	Dream 7B	Qwen2.5 7B
Model	Diffusion	Diffusion	Diffusion	AR
Training tokens	12T	2.3T	18T + 0.6T	18T
MMLU	74.8	65.9	69.5	71.9
BBH	71.3	49.7	57.9	63.9
ARC-C	60.8	45.9	59.8	51.5
HellaSwag	76.6	70.5	73.3	79.0
GSM8K	81.9	70.3	77.2	78.9
MATH	38.4	31.4	39.6	41.1
HumanEval	50.0	35.4	57.9	56.7
MBPP	57.8	40.0	56.2	63.6
Average	63.9	51.1	61.4	63.3

Downloads last month: 37

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support