---
library_name: pytorch
tags:
  - diffusion
  - language-model
  - discrete-diffusion
  - absorbing-state
  - text-generation
  - tinystories
  - from-scratch
datasets:
  - roneneldan/TinyStories
pipeline_tag: text-generation
---

# diffusionlm-from-scratch — masked diffusion LM (DiT, 142M)

A masked (absorbing-state) **diffusion language model**, built and trained from
scratch on TinyStories. Instead of generating left-to-right one token at a time,
it starts from a sequence of pure `[MASK]` tokens and **denoises the whole
sequence in parallel** — committing the tokens it is most confident about first,
in whatever order the meaning falls into place.

- **Code, training & sampling:** https://github.com/tchauffi/diffusionlm-from-scratch
- **Course / write-up:** [`RESEARCH.md`](https://github.com/tchauffi/diffusionlm-from-scratch/blob/main/RESEARCH.md) — a from-scratch course on discrete/text diffusion (D3PM → absorbing-state → sampling).
- **Demo site:** animated real generations live in [`docs/`](https://github.com/tchauffi/diffusionlm-from-scratch/tree/main/docs).

## Model

| | |
|---|---|
| Architecture | DiT (transformer denoiser), bidirectional attention, adaLN-Zero |
| Parameters | ~142M |
| Hidden size / depth / heads | 768 / 12 / 12 |
| MLP ratio | 4.0 |
| Vocab | 8,192 (byte-level BPE, trained on TinyStories) |
| Max sequence length | 256 |
| Diffusion | absorbing-state (masked) discrete diffusion |
| Training data | [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) |
| Eval cross-entropy | **2.18** |

**Key finding:** uniform loss weighting (`w(t) = 1`), *not* the textbook ELBO
weight `1/σ(t)`, is what turned word-salad into coherent stories.

## Files

- `final.pt` — checkpoint with two state dicts, `model` (EMA, preferred) and
  `raw`, plus the `config` used to build the model.
- `tokenizer.json`, `tokenizer_config.json` — the byte-level BPE tokenizer
  (`PreTrainedTokenizerFast`; special tokens `[PAD]` `[UNK]` `[MASK]`
  `<|endoftext|>`).

## Usage

Install the model code from the
[GitHub repo](https://github.com/tchauffi/diffusionlm-from-scratch), then
generate stories in two lines — `DiffusionLM` bundles the model, tokenizer, and
absorbing-state scheduler:

```python
from diffusionlm_from_scratch import DiffusionLM

lm = DiffusionLM.from_pretrained("tchauffi/diffusionlm-from-scratch")
for story in lm.generate(n=4, seq_len=80, temperature=0.9):
    print(story)
```

`generate` exposes the sampler knobs (`order`, `steps`, `corrector_frac`,
`confidence_threshold`, …). For lower-level access, load just the model:

```python
from diffusionlm_from_scratch.model import DiT

model = DiT.from_pretrained("tchauffi/diffusionlm-from-scratch")  # downloads final.pt
# the raw checkpoint carries ck["config"], ck["model"] (EMA), and ck["raw"].
```

See [`scripts/capture_trajectories.py`](https://github.com/tchauffi/diffusionlm-from-scratch/blob/main/scripts/capture_trajectories.py)
in the repo for the full parallel-denoising sampling loop.