xdlm-owt — XDLM adapted from mdlm-owt (60000 steps, k=0.1)
An XDLM (arXiv 2602.01362, Balancing Understanding and Generation in Discrete
Diffusion Models) obtained by continued-pretraining
kuleshov-group/mdlm-owt (a pure
MDLM) into the XDLM formulation on plain OpenWebText.
- XDLM = a stationary mixed noise kernel
K = (k/N)·J + μ·Munifying MDLM (k=0, absorbing/mask) and UDLM (k=1, uniform). Mixing ratio k=0.1 (the paper's sweet spot): of each corrupted token's mass, (1−k) goes to[MASK]and k to a uniform real token. Trained with the paper's unified single-posterior NELBO (eq. 15), which reduces exactly to MDLM at k=0 and UDLM at k=1. - 169.6M vendored Duo DiT backbone, GPT-2 tokenizer, vocab 50258 (
[MASK]=50257, pad=eos=50256).time_conditioning=False(sigma=0), matching mdlm-owt. - Data: plain
EER6/openwebtext-coarsetext, packed to L1024 (first 2048 docs held out for validation). - Recipe (paper-matched MDLM→XDLM SFT): 60000 steps, lr 2e-5 constant (warmup 100), AdamW(0.9, 0.999), wd 0, bf16, global batch 288, EMA 0.9999, k=0.1.
- These are the EMA weights of checkpoint-60000 (DiT backbone state_dict; flat
model.safetensorsat repo root, same layout as mdlm-owt).
Results: held-out val NELBO 3.537 (ppl 34.4) on the k=0.1 XDLM objective. Correctness: the training loss reduces bit-identically to the MDLM loss at k=0 and to the UDLM ELBO at k=1; a k=0 generation control reproduces the mdlm-owt gen-PPL (~59), so the XDLM sampler and pipeline are validated. Generation uses the exact XDLM reverse posterior (eq. 11) over {real}∪{mask} — not commit-once (the uniform channel can revise tokens).
Load (project code):
sampling/sample_xdlm.py --model EER6/xdlm-owt --k_uniform 0.1 or
duo_core.load_model("EER6/xdlm-owt", 1024, 50258, device). Adapt further with
train/adapt_to_xdlm.py --init_ckpt EER6/xdlm-owt.
Reference companions: EER6/mdlm-owt-diff1,
EER6/mdlm-owt-trash.
- Downloads last month
- 17