EDLM External Init Checkpoints

External MDLM and AR baseline checkpoints used to seed EDLM-Soft warmup runs in the draft-refine project, plus the best-trained ckpt produced by the inference + training improvements documented below.

Files

File	Size	Arch	Steps	Notes
`mdlm.ckpt`	2.7 GB	DiT (custom; `models/dit.py`) — `sigma_map.mlp.*` + `norm1` per block	1,112,551 (epoch 67)	Drive Lightning ckpt; full state. NOT the same arch as `kuleshov-group/mdlm-owt` which uses `adaLN_modulation`.
`ar.ckpt`	2.6 GB	AR baseline (no `sigma_map`)	1,060,000 (epoch 72)	Drive Lightning ckpt. 101 backbone keys. Partial init when loaded into a DiT model via `strict=False`.
`edlm-step2k-pure-mdlm.ckpt`	2.7 GB	DiT (HF kuleshov-group/mdlm-owt) + tiny_queries scorer head	2000 (post-pretrained)	Best ckpt: 2000 steps of pure MDLM CE training on top of `kuleshov-group/mdlm-owt`. Achieves PPL=10.76 with `uncommitted_soft @ 256 steps` sampler (paper-aligned 128 samples eval).

Headline result

Vanilla kuleshov-group/mdlm-owt MDLM-1M backbone, paper-aligned 128-sample gen-PPL eval under gpt2-large evaluator (matches EDLM paper's eval protocol):

Setup	sampler	steps	gen-PPL	factor below paper-61
Vanilla MDLM (no train)	`ddpm_cache` (paper)	1000	60.28	replicates paper baseline
Vanilla MDLM (no train)	`uncommitted_soft` (ours)	256	17.04	3.5×
`edlm-step2k-pure-mdlm.ckpt`	`uncommitted_soft` (ours)	256	10.76	5.7×

Winning training recipe

backbone: dit                          # (overridden by EBM hardcode → hf_dit)
ebm_backbone: tiny_queries
training:
  k_max_final: 1                       # NO rollouts (standard MDLM)
  soft_alpha: False                    # NO soft-α blending
  loss_on_all_positions: True
  threshold_commit_train: 0.9          # not used when k_max=1
  force_commit_strategy_train: none
sampling:
  predictor: uncommitted_soft
  steps: 256
  threshold: 0.9
  commit_sampling: True
  commit_temperature: 1.0
  force_commit_strategy: uniform

Negative results (avoid)

k_max_final > 1: training rollouts cause monotonic regression to PPL ~200 within 5k steps
soft_alpha: True: training-time soft-α blending drifts model away from OWT distribution
force_commit_strategy_train: uniform: similar regression; not additive
predictor: ddpm_cache at steps: 256: too few steps; gives PPL=87.58 instead of 60.28
commit_temperature < 1.0: mode collapse (PPL drops but text becomes repetitive)
steps > 512 with threshold < 0.9: degenerate samples (low entropy, repetitive lists)

Compatibility notes

mdlm.ckpt and ar.ckpt (Drive originals) match the local models/dit.py arch (sigma_map.mlp.*) but DO NOT load cleanly into the standard EDLM EBM model — gen produces garbage at PPL=109788. Different layer ordering or normalization details in their training script.
edlm-step2k-pure-mdlm.ckpt is a Lightning ckpt produced by training on top of kuleshov-group/mdlm-owt with the recipe above. It loads cleanly via Lightning.load_from_checkpoint or eval.partial_load_ckpt: True in the EDLM trainer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support