llada-346m / README.md

Update README.md

6b79aca verified 4 months ago

4.12 kB

	---
	license: apache-2.0
	language:
	- en
	---
	# LLaDA-346M: Large Language Diffusion with Masking

	## Model Description

	This is a 346 Million parameter Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.

	### Key Features
	- Architecture: Masked Diffusion Model (MDM) with Transformer encoder
	- Parameters: 346M
	- Sequence Length: 512 tokens
	- Vocab Size: 50,257 (GPT-2)
	- Training Data: 50,000 WikiText-2 samples

	## Model Architecture

	```
	Token Embeddings (50257 × 1024)
	↓
	Position Embeddings (512 × 1024)
	↓
	Time Embeddings (MLP)
	↓
	Transformer Encoder (12 layers, 16 heads)
	├─ Self-Attention
	└─ Feed-Forward (4096 dim)
	↓
	Output Projection (1024 × 50257)
	```

	## Training Details

	- Algorithm: Masked Diffusion Model (MDM)
	- Loss Function: Cross-entropy on masked positions
	- Optimizer: AdamW (lr=3e-5, betas=(0.9, 0.95))
	- Batch Size: 16 (effective: 32 with grad accumulation)
	- Gradient Checkpointing: Enabled
	- Mixed Precision: AMP (FP32/FP16)
	- Epochs: 4
	- Training Samples: 50,000
	- GPU: NVIDIA V100 (22GB VRAM)
	- Training Time: ~20 hours

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Initial Loss \| 5.96 \|
	\| Final Loss \| 4.94 \|
	\| Loss Reduction \| 17.1% \|
	\| Total Parameters \| 346M \|
	\| Model Size (FP32) \| 1.38 GB \|

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Loading the Model

	```python
	import torch
	from transformers import AutoTokenizer
	from your_module import MaskedDiffusionModel

	# Load model
	model = MaskedDiffusionModel(
	vocab_size=50257,
	hidden_dim=1024,
	num_layers=12,
	num_heads=16,
	ff_dim=4096,
	dropout=0.1,
	max_seq_length=512,
	num_timesteps=100
	)

	# Load weights
	checkpoint = torch.load("pytorch_model.bin")
	model.load_state_dict(checkpoint)
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("gpt2")
	```

	### Text Generation

	```python
	from diffusion_sampler import DiffusionSampler

	sampler = DiffusionSampler(model, tokenizer, config, device)

	# Generate text
	text = sampler.generate(
	prompt="The future of AI",
	num_steps=40,
	temperature=0.8,
	top_p=0.9
	)
	print(text)
	```

	## Model Characteristics

	### Advantages
	✅ Bidirectional Context: Sees full context unlike autoregressive models
	✅ Parallel Generation: Can predict multiple tokens simultaneously
	✅ Reversal Invariance: Equal performance on forward and reverse tasks
	✅ Global Coherence: Reduces error accumulation

	### Limitations
	❌ Slower generation (iterative denoising process)
	❌ Requires more compute for inference
	❌ Not fine-tuned for specific tasks

	## Training Process

	### Forward Process
	- Gradually mask tokens randomly
	- At timestep t ∈ [0,1], each token masked with probability t
	- Creates noisy version of input

	### Reverse Process
	- Iteratively predict and unmask tokens
	- Uses transformer to predict masked positions
	- Trained with cross-entropy loss on masked tokens only

	## Optimization Techniques

	- Gradient Checkpointing: Save memory during backprop
	- Mixed Precision (AMP): Use FP16 where possible
	- Gradient Accumulation: Simulate larger batches
	- Layer Norm First: Improved training stability

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{nie2025llada,
	title={Large Language Diffusion Models},
	author={Nie, Shen and others},
	journal={arXiv preprint arXiv:2502.09992},
	year={2025}
	}
	```

	## License

	MIT License - Feel free to use for research and commercial purposes

	## Acknowledgments

	- Based on "Large Language Diffusion Models" (Nie et al., 2025)
	- Built with PyTorch and Transformers
	- Trained on WikiText-2 dataset
	- Inspired by diffusion models for vision (DiT, Genie)

	## Contact & Support

	For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.

	---

	Last Updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}