iLLaDA-8B-Base
iLLaDA is an 8B fully bidirectional masked diffusion language model trained from scratch with 12T pre-training tokens, an 8192-token context length, variable-length generation, and confidence-based scoring for multiple-choice evaluation. Inference and evaluation codes: https://github.com/ML-GSAI/LLaDA.
Architecture
| iLLaDA 8B | LLaDA 8B | |
|---|---|---|
| Layers | 32 | 32 |
| Model dimension | 4096 | 4096 |
| Attention heads | 32 | 32 |
| Key/Value heads | 8 | 32 |
| FFN dimension | 14,336 | 12,288 |
| Vocabulary size | 155,136 | 126,464 |
| Maximum sequence length | 8192 | 4096 |
| Embedding and LM-head | Tied | Untied |
| Total parameters | 7.62B | 8.02B |
| Non-embedding parameters | 6.98B | 6.98B |
Benchmark Results of Base Models
| iLLaDA 8B | LLaDA 8B | Dream 7B | Qwen2.5 7B | |
|---|---|---|---|---|
| Model | Diffusion | Diffusion | Diffusion | AR |
| Training tokens | 12T | 2.3T | 18T + 0.6T | 18T |
| MMLU | 74.8 | 65.9 | 69.5 | 71.9 |
| BBH | 71.3 | 49.7 | 57.9 | 63.9 |
| ARC-C | 60.8 | 45.9 | 59.8 | 51.5 |
| HellaSwag | 76.6 | 70.5 | 73.3 | 79.0 |
| GSM8K | 81.9 | 70.3 | 77.2 | 78.9 |
| MATH | 38.4 | 31.4 | 39.6 | 41.1 |
| HumanEval | 50.0 | 35.4 | 57.9 | 56.7 |
| MBPP | 57.8 | 40.0 | 56.2 | 63.6 |
| Average | 63.9 | 51.1 | 61.4 | 63.3 |
- Downloads last month
- 37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support