iLLaDA-8B-Instruct
iLLaDA is an 8B fully bidirectional masked diffusion language model trained from scratch with 12T pre-training tokens, an 8192-token context length, variable-length generation, and confidence-based scoring for multiple-choice evaluation. Inference and evaluation codes: https://github.com/ML-GSAI/LLaDA.
Architecture
| iLLaDA 8B | LLaDA 8B | |
|---|---|---|
| Layers | 32 | 32 |
| Model dimension | 4096 | 4096 |
| Attention heads | 32 | 32 |
| Key/Value heads | 8 | 32 |
| FFN dimension | 14,336 | 12,288 |
| Vocabulary size | 155,136 | 126,464 |
| Maximum sequence length | 8192 | 4096 |
| Embedding and LM-head | Tied | Untied |
| Total parameters | 7.62B | 8.02B |
| Non-embedding parameters | 6.98B | 6.98B |
Benchmark Results of Instruct Models
| iLLaDA 8B | LLaDA 8B | Dream 7B | Qwen2.5 7B | |
|---|---|---|---|---|
| Model | Diffusion | Diffusion | Diffusion | AR |
| MMLU | 71.6 | 65.5 | 67.0 | 76.6 |
| MMLU-Pro | 52.3 | 37.0 | 43.3 | 56.3 |
| MMLU-Redux | 76.4 | 68.9 | 76.3 | 75.7 |
| GSM8K | 89.0 | 77.5 | 81.0 | 91.6 |
| MATH | 56.7 | 42.2 | 39.2 | 75.5 |
| HumanEval | 65.9 | 49.4 | 55.5 | 84.8 |
| MBPP | 58.0 | 41.0 | 58.8 | 79.2 |
| Average | 67.1 | 54.5 | 60.2 | 77.1 |
- Downloads last month
- 32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support