yasserrmd commited on
Commit
bf04f18
·
verified ·
1 Parent(s): c8f6586

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLaDA-346M: Large Language Diffusion with Masking
2
+
3
+ ## Model Description
4
+
5
+ This is a **346 Million parameter** Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.
6
+
7
+ ### Key Features
8
+ - **Architecture**: Masked Diffusion Model (MDM) with Transformer encoder
9
+ - **Parameters**: 346M
10
+ - **Sequence Length**: 512 tokens
11
+ - **Vocab Size**: 50,257 (GPT-2)
12
+ - **Training Data**: 50,000 WikiText-2 samples
13
+
14
+ ## Model Architecture
15
+
16
+ ```
17
+ Token Embeddings (50257 × 1024)
18
+
19
+ Position Embeddings (512 × 1024)
20
+
21
+ Time Embeddings (MLP)
22
+
23
+ Transformer Encoder (12 layers, 16 heads)
24
+ ├─ Self-Attention
25
+ └─ Feed-Forward (4096 dim)
26
+
27
+ Output Projection (1024 × 50257)
28
+ ```
29
+
30
+ ## Training Details
31
+
32
+ - **Algorithm**: Masked Diffusion Model (MDM)
33
+ - **Loss Function**: Cross-entropy on masked positions
34
+ - **Optimizer**: AdamW (lr=3e-5, betas=(0.9, 0.95))
35
+ - **Batch Size**: 16 (effective: 32 with grad accumulation)
36
+ - **Gradient Checkpointing**: Enabled
37
+ - **Mixed Precision**: AMP (FP32/FP16)
38
+ - **Epochs**: 4
39
+ - **Training Samples**: 50,000
40
+ - **GPU**: NVIDIA V100 (22GB VRAM)
41
+ - **Training Time**: ~20 hours
42
+
43
+ ## Performance
44
+
45
+ | Metric | Value |
46
+ |--------|-------|
47
+ | Initial Loss | 5.96 |
48
+ | Final Loss | 4.94 |
49
+ | Loss Reduction | 17.1% |
50
+ | Total Parameters | 346M |
51
+ | Model Size (FP32) | 1.38 GB |
52
+
53
+ ## Usage
54
+
55
+ ### Installation
56
+
57
+ ```bash
58
+ pip install transformers torch
59
+ ```
60
+
61
+ ### Loading the Model
62
+
63
+ ```python
64
+ import torch
65
+ from transformers import AutoTokenizer
66
+ from your_module import MaskedDiffusionModel
67
+
68
+ # Load model
69
+ model = MaskedDiffusionModel(
70
+ vocab_size=50257,
71
+ hidden_dim=1024,
72
+ num_layers=12,
73
+ num_heads=16,
74
+ ff_dim=4096,
75
+ dropout=0.1,
76
+ max_seq_length=512,
77
+ num_timesteps=100
78
+ )
79
+
80
+ # Load weights
81
+ checkpoint = torch.load("pytorch_model.bin")
82
+ model.load_state_dict(checkpoint)
83
+ model.eval()
84
+
85
+ # Load tokenizer
86
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
87
+ ```
88
+
89
+ ### Text Generation
90
+
91
+ ```python
92
+ from diffusion_sampler import DiffusionSampler
93
+
94
+ sampler = DiffusionSampler(model, tokenizer, config, device)
95
+
96
+ # Generate text
97
+ text = sampler.generate(
98
+ prompt="The future of AI",
99
+ num_steps=40,
100
+ temperature=0.8,
101
+ top_p=0.9
102
+ )
103
+ print(text)
104
+ ```
105
+
106
+ ## Model Characteristics
107
+
108
+ ### Advantages
109
+ ✅ **Bidirectional Context**: Sees full context unlike autoregressive models
110
+ ✅ **Parallel Generation**: Can predict multiple tokens simultaneously
111
+ ✅ **Reversal Invariance**: Equal performance on forward and reverse tasks
112
+ ✅ **Global Coherence**: Reduces error accumulation
113
+
114
+ ### Limitations
115
+ ❌ Slower generation (iterative denoising process)
116
+ ❌ Requires more compute for inference
117
+ ❌ Not fine-tuned for specific tasks
118
+
119
+ ## Training Process
120
+
121
+ ### Forward Process
122
+ - Gradually mask tokens randomly
123
+ - At timestep t ∈ [0,1], each token masked with probability t
124
+ - Creates noisy version of input
125
+
126
+ ### Reverse Process
127
+ - Iteratively predict and unmask tokens
128
+ - Uses transformer to predict masked positions
129
+ - Trained with cross-entropy loss on masked tokens only
130
+
131
+ ## Optimization Techniques
132
+
133
+ - **Gradient Checkpointing**: Save memory during backprop
134
+ - **Mixed Precision (AMP)**: Use FP16 where possible
135
+ - **Gradient Accumulation**: Simulate larger batches
136
+ - **Layer Norm First**: Improved training stability
137
+
138
+ ## Citation
139
+
140
+ If you use this model, please cite:
141
+
142
+ ```bibtex
143
+ @article{nie2025llada,
144
+ title={Large Language Diffusion Models},
145
+ author={Nie, Shen and others},
146
+ journal={arXiv preprint arXiv:2502.09992},
147
+ year={2025}
148
+ }
149
+ ```
150
+
151
+ ## License
152
+
153
+ MIT License - Feel free to use for research and commercial purposes
154
+
155
+ ## Acknowledgments
156
+
157
+ - Based on "Large Language Diffusion Models" (Nie et al., 2025)
158
+ - Built with PyTorch and Transformers
159
+ - Trained on WikiText-2 dataset
160
+ - Inspired by diffusion models for vision (DiT, Genie)
161
+
162
+ ## Contact & Support
163
+
164
+ For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.
165
+
166
+ ---
167
+
168
+ **Last Updated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MaskedDiffusionModel"
4
+ ],
5
+ "model_type": "llada",
6
+ "vocab_size": 50257,
7
+ "hidden_size": 1024,
8
+ "num_hidden_layers": 12,
9
+ "num_attention_heads": 16,
10
+ "intermediate_size": 4096,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "attention_probs_dropout_prob": 0.1,
14
+ "max_position_embeddings": 512,
15
+ "initializer_range": 0.02,
16
+ "layer_norm_eps": 1e-12,
17
+ "pad_token_id": 50256,
18
+ "bos_token_id": 50256,
19
+ "eos_token_id": 50256,
20
+ "num_timesteps": 100,
21
+ "masking_schedule": "uniform"
22
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1719f3f94a3b14a5a4d7023efdb2bd10158d9acb25833c67d15a209dbd070aec
3
+ size 1022902031
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "50256": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ }
13
+ },
14
+ "bos_token": "<|endoftext|>",
15
+ "clean_up_tokenization_spaces": false,
16
+ "eos_token": "<|endoftext|>",
17
+ "errors": "replace",
18
+ "extra_special_tokens": {},
19
+ "model_max_length": 1024,
20
+ "pad_token": "<|endoftext|>",
21
+ "tokenizer_class": "GPT2Tokenizer",
22
+ "unk_token": "<|endoftext|>"
23
+ }
training_info.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "LLaDA-346M",
3
+ "parameters": 255709265,
4
+ "training_samples": 23679,
5
+ "training_steps": 5916,
6
+ "final_loss": 1.40234375,
7
+ "initial_loss": 10.90625,
8
+ "training_time_hours": 591.6,
9
+ "config": {}
10
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff