File size: 2,079 Bytes
2539abc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

  # Diffusion LM — TinyStories

  A masked-diffusion language model trained from scratch on the
  [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.

  ## Demo

  ![Diffusion inference](inference.gif)

  ## Architecture

  | Param | Value |
  |---|---|
  | Parameters | ~45M |
  | Hidden dim | 512 |
  | Layers | 10 |
  | Heads | 8 |
  | FFN dim | 2048 |
  | Diffusion steps T | 128 |
  | Sequence length | 256 |
  | Vocab size | 26,000 |

  ## How it works

  This is a **masked diffusion** language model. Instead of generating
  tokens left-to-right like a standard LM, it starts with a fully masked
  sequence and progressively unmasks tokens over T diffusion steps.

  At each step the model predicts all masked tokens simultaneously, then
  re-masks the least confident predictions and repeats — gradually
  refining the output until the sequence is fully unmasked.

  ## Training

  - Dataset: 1M TinyStories examples
  - Train steps: 60,000
  - Effective batch size: 64 (batch 32 × grad accum 2)
  - Optimizer: AdamW
  - Learning rate: 2e-4 with cosine schedule and 1,000 warmup steps
  - Weight decay: 0.1
  - Mixed precision: bf16
  - Hardware: NVIDIA RTX 3090 (24GB)

  ## Evaluation

  Val loss (cross-entropy on masked tokens, 20 batches of held-out TinyStories):

  | Step | Val Loss |
  |------|----------|
  | 5,000 | 6.0313 |
  | 10,000 | 5.9045 |
  | 15,000 | 5.6092 |
  | 20,000 | 4.4481 |
  | 25,000 | 3.8447 |
  | 30,000 | 3.6634 |
  | 35,000 | 3.5419 |
  | 40,000 | 3.3554 |
  | 45,000 | 3.2779 |
  | 50,000 | 3.1767 |
  | 55,000 | 3.1012 |
  | 60,000 | 3.1067 |

  The loss drop between steps 15,000–25,000 reflects the model learning
  basic language structure. Convergence around 3.10 by step 55,000.

  ## Files

  | File | Description |
  |---|---|
  | `model.pt` | Model weights (PyTorch state dict) |
  | `config.json` | Architecture hyperparameters |
  | `tokenizer/` | Byte-level BPE tokenizer |
  | `val_loss_history.json` | Validation loss curve |
  | `inference.gif` | Visualisation of progressive unmasking |