---
license: apache-2.0
language:
- en
---
# LLaDA-346M: Large Language Diffusion with Masking

## Model Description

This is a **346 Million parameter** Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.

### Key Features
- **Architecture**: Masked Diffusion Model (MDM) with Transformer encoder
- **Parameters**: 346M
- **Sequence Length**: 512 tokens
- **Vocab Size**: 50,257 (GPT-2)
- **Training Data**: 50,000 WikiText-2 samples

## Model Architecture

```
Token Embeddings (50257 × 1024)
    ↓
Position Embeddings (512 × 1024)
    ↓
Time Embeddings (MLP)
    ↓
Transformer Encoder (12 layers, 16 heads)
    ├─ Self-Attention
    └─ Feed-Forward (4096 dim)
    ↓
Output Projection (1024 × 50257)
```

## Training Details

- **Algorithm**: Masked Diffusion Model (MDM)
- **Loss Function**: Cross-entropy on masked positions
- **Optimizer**: AdamW (lr=3e-5, betas=(0.9, 0.95))
- **Batch Size**: 16 (effective: 32 with grad accumulation)
- **Gradient Checkpointing**: Enabled
- **Mixed Precision**: AMP (FP32/FP16)
- **Epochs**: 4
- **Training Samples**: 50,000
- **GPU**: NVIDIA V100 (22GB VRAM)
- **Training Time**: ~20 hours

## Performance

| Metric | Value |
|--------|-------|
| Initial Loss | 5.96 |
| Final Loss | 4.94 |
| Loss Reduction | 17.1% |
| Total Parameters | 346M |
| Model Size (FP32) | 1.38 GB |

## Usage

### Installation

```bash
pip install transformers torch
```

### Loading the Model

```python
import torch
from transformers import AutoTokenizer
from your_module import MaskedDiffusionModel

# Load model
model = MaskedDiffusionModel(
    vocab_size=50257,
    hidden_dim=1024,
    num_layers=12,
    num_heads=16,
    ff_dim=4096,
    dropout=0.1,
    max_seq_length=512,
    num_timesteps=100
)

# Load weights
checkpoint = torch.load("pytorch_model.bin")
model.load_state_dict(checkpoint)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
```

### Text Generation

```python
from diffusion_sampler import DiffusionSampler

sampler = DiffusionSampler(model, tokenizer, config, device)

# Generate text
text = sampler.generate(
    prompt="The future of AI",
    num_steps=40,
    temperature=0.8,
    top_p=0.9
)
print(text)
```

## Model Characteristics

### Advantages
✅ **Bidirectional Context**: Sees full context unlike autoregressive models  
✅ **Parallel Generation**: Can predict multiple tokens simultaneously  
✅ **Reversal Invariance**: Equal performance on forward and reverse tasks  
✅ **Global Coherence**: Reduces error accumulation  

### Limitations
❌ Slower generation (iterative denoising process)  
❌ Requires more compute for inference  
❌ Not fine-tuned for specific tasks  

## Training Process

### Forward Process
- Gradually mask tokens randomly
- At timestep t ∈ [0,1], each token masked with probability t
- Creates noisy version of input

### Reverse Process
- Iteratively predict and unmask tokens
- Uses transformer to predict masked positions
- Trained with cross-entropy loss on masked tokens only

## Optimization Techniques

- **Gradient Checkpointing**: Save memory during backprop
- **Mixed Precision (AMP)**: Use FP16 where possible
- **Gradient Accumulation**: Simulate larger batches
- **Layer Norm First**: Improved training stability

## Citation

If you use this model, please cite:

```bibtex
@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and others},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}
```

## License

MIT License - Feel free to use for research and commercial purposes

## Acknowledgments

- Based on "Large Language Diffusion Models" (Nie et al., 2025)
- Built with PyTorch and Transformers
- Trained on WikiText-2 dataset
- Inspired by diffusion models for vision (DiT, Genie)

## Contact & Support

For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.

---

**Last Updated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}