Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +168 -0
config.json +22 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +24 -0
tokenizer_config.json +23 -0
training_info.json +10 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# LLaDA-346M: Large Language Diffusion with Masking
+## Model Description
+This is a **346 Million parameter** Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.
+### Key Features
+- **Architecture**: Masked Diffusion Model (MDM) with Transformer encoder
+- **Parameters**: 346M
+- **Sequence Length**: 512 tokens
+- **Vocab Size**: 50,257 (GPT-2)
+- **Training Data**: 50,000 WikiText-2 samples
+## Model Architecture
+```
+Token Embeddings (50257 × 1024)
+    ↓
+Position Embeddings (512 × 1024)
+    ↓
+Time Embeddings (MLP)
+    ↓
+Transformer Encoder (12 layers, 16 heads)
+    ├─ Self-Attention
+    └─ Feed-Forward (4096 dim)
+    ↓
+Output Projection (1024 × 50257)
+```
+## Training Details
+- **Algorithm**: Masked Diffusion Model (MDM)
+- **Loss Function**: Cross-entropy on masked positions
+- **Optimizer**: AdamW (lr=3e-5, betas=(0.9, 0.95))
+- **Batch Size**: 16 (effective: 32 with grad accumulation)
+- **Gradient Checkpointing**: Enabled
+- **Mixed Precision**: AMP (FP32/FP16)
+- **Epochs**: 4
+- **Training Samples**: 50,000
+- **GPU**: NVIDIA V100 (22GB VRAM)
+- **Training Time**: ~20 hours
+## Performance
+| Metric | Value |
+|--------|-------|
+| Initial Loss | 5.96 |
+| Final Loss | 4.94 |
+| Loss Reduction | 17.1% |
+| Total Parameters | 346M |
+| Model Size (FP32) | 1.38 GB |
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Loading the Model
+```python
+import torch
+from transformers import AutoTokenizer
+from your_module import MaskedDiffusionModel
+# Load model
+model = MaskedDiffusionModel(
+    vocab_size=50257,
+    hidden_dim=1024,
+    num_layers=12,
+    num_heads=16,
+    ff_dim=4096,
+    dropout=0.1,
+    max_seq_length=512,
+    num_timesteps=100
+)
+# Load weights
+checkpoint = torch.load("pytorch_model.bin")
+model.load_state_dict(checkpoint)
+model.eval()
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+```
+### Text Generation
+```python
+from diffusion_sampler import DiffusionSampler
+sampler = DiffusionSampler(model, tokenizer, config, device)
+# Generate text
+text = sampler.generate(
+    prompt="The future of AI",
+    num_steps=40,
+    temperature=0.8,
+    top_p=0.9
+)
+print(text)
+```
+## Model Characteristics
+### Advantages
+✅ **Bidirectional Context**: Sees full context unlike autoregressive models
+✅ **Parallel Generation**: Can predict multiple tokens simultaneously
+✅ **Reversal Invariance**: Equal performance on forward and reverse tasks
+✅ **Global Coherence**: Reduces error accumulation
+### Limitations
+❌ Slower generation (iterative denoising process)
+❌ Requires more compute for inference
+❌ Not fine-tuned for specific tasks
+## Training Process
+### Forward Process
+- Gradually mask tokens randomly
+- At timestep t ∈ [0,1], each token masked with probability t
+- Creates noisy version of input
+### Reverse Process
+- Iteratively predict and unmask tokens
+- Uses transformer to predict masked positions
+- Trained with cross-entropy loss on masked tokens only
+## Optimization Techniques
+- **Gradient Checkpointing**: Save memory during backprop
+- **Mixed Precision (AMP)**: Use FP16 where possible
+- **Gradient Accumulation**: Simulate larger batches
+- **Layer Norm First**: Improved training stability
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{nie2025llada,
+  title={Large Language Diffusion Models},
+  author={Nie, Shen and others},
+  journal={arXiv preprint arXiv:2502.09992},
+  year={2025}
+}
+```
+## License
+MIT License - Feel free to use for research and commercial purposes
+## Acknowledgments
+- Based on "Large Language Diffusion Models" (Nie et al., 2025)
+- Built with PyTorch and Transformers
+- Trained on WikiText-2 dataset
+- Inspired by diffusion models for vision (DiT, Genie)
+## Contact & Support
+For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.
+---
+**Last Updated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "architectures": [
+    "MaskedDiffusionModel"
+  ],
+  "model_type": "llada",
+  "vocab_size": 50257,
+  "hidden_size": 1024,
+  "num_hidden_layers": 12,
+  "num_attention_heads": 16,
+  "intermediate_size": 4096,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "attention_probs_dropout_prob": 0.1,
+  "max_position_embeddings": 512,
+  "initializer_range": 0.02,
+  "layer_norm_eps": 1e-12,
+  "pad_token_id": 50256,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "num_timesteps": 100,
+  "masking_schedule": "uniform"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1719f3f94a3b14a5a4d7023efdb2bd10158d9acb25833c67d15a209dbd070aec
+size 1022902031

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

training_info.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "model_name": "LLaDA-346M",
+  "parameters": 255709265,
+  "training_samples": 23679,
+  "training_steps": 5916,
+  "final_loss": 1.40234375,
+  "initial_loss": 10.90625,
+  "training_time_hours": 591.6,
+  "config": {}
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff