EDLM External Init Checkpoints
External MDLM and AR baseline checkpoints used to seed EDLM-Soft warmup runs in the draft-refine project, plus the best-trained ckpt produced by the inference + training improvements documented below.
Files
| File | Size | Arch | Steps | Notes |
|---|---|---|---|---|
mdlm.ckpt |
2.7 GB | DiT (custom; models/dit.py) — sigma_map.mlp.* + norm1 per block |
1,112,551 (epoch 67) | Drive Lightning ckpt; full state. NOT the same arch as kuleshov-group/mdlm-owt which uses adaLN_modulation. |
ar.ckpt |
2.6 GB | AR baseline (no sigma_map) |
1,060,000 (epoch 72) | Drive Lightning ckpt. 101 backbone keys. Partial init when loaded into a DiT model via strict=False. |
edlm-step2k-pure-mdlm.ckpt |
2.7 GB | DiT (HF kuleshov-group/mdlm-owt) + tiny_queries scorer head | 2000 (post-pretrained) | Best ckpt: 2000 steps of pure MDLM CE training on top of kuleshov-group/mdlm-owt. Achieves PPL=10.76 with uncommitted_soft @ 256 steps sampler (paper-aligned 128 samples eval). |
Headline result
Vanilla kuleshov-group/mdlm-owt MDLM-1M backbone, paper-aligned 128-sample gen-PPL eval under gpt2-large evaluator (matches EDLM paper's eval protocol):
| Setup | sampler | steps | gen-PPL | factor below paper-61 |
|---|---|---|---|---|
| Vanilla MDLM (no train) | ddpm_cache (paper) |
1000 | 60.28 | replicates paper baseline |
| Vanilla MDLM (no train) | uncommitted_soft (ours) |
256 | 17.04 | 3.5× |
edlm-step2k-pure-mdlm.ckpt |
uncommitted_soft (ours) |
256 | 10.76 | 5.7× |
Winning training recipe
backbone: dit # (overridden by EBM hardcode → hf_dit)
ebm_backbone: tiny_queries
training:
k_max_final: 1 # NO rollouts (standard MDLM)
soft_alpha: False # NO soft-α blending
loss_on_all_positions: True
threshold_commit_train: 0.9 # not used when k_max=1
force_commit_strategy_train: none
sampling:
predictor: uncommitted_soft
steps: 256
threshold: 0.9
commit_sampling: True
commit_temperature: 1.0
force_commit_strategy: uniform
Negative results (avoid)
k_max_final > 1: training rollouts cause monotonic regression to PPL ~200 within 5k stepssoft_alpha: True: training-time soft-α blending drifts model away from OWT distributionforce_commit_strategy_train: uniform: similar regression; not additivepredictor: ddpm_cacheatsteps: 256: too few steps; gives PPL=87.58 instead of 60.28commit_temperature < 1.0: mode collapse (PPL drops but text becomes repetitive)steps > 512withthreshold < 0.9: degenerate samples (low entropy, repetitive lists)
Compatibility notes
mdlm.ckptandar.ckpt(Drive originals) match the localmodels/dit.pyarch (sigma_map.mlp.*) but DO NOT load cleanly into the standard EDLM EBM model — gen produces garbage at PPL=109788. Different layer ordering or normalization details in their training script.edlm-step2k-pure-mdlm.ckptis a Lightning ckpt produced by training on top ofkuleshov-group/mdlm-owtwith the recipe above. It loads cleanly viaLightning.load_from_checkpointoreval.partial_load_ckpt: Truein the EDLM trainer.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support