EDLM External Init Checkpoints

External MDLM and AR baseline checkpoints used to seed EDLM-Soft warmup runs in the draft-refine project, plus the best-trained ckpt produced by the inference + training improvements documented below.

Files

File Size Arch Steps Notes
mdlm.ckpt 2.7 GB DiT (custom; models/dit.py) — sigma_map.mlp.* + norm1 per block 1,112,551 (epoch 67) Drive Lightning ckpt; full state. NOT the same arch as kuleshov-group/mdlm-owt which uses adaLN_modulation.
ar.ckpt 2.6 GB AR baseline (no sigma_map) 1,060,000 (epoch 72) Drive Lightning ckpt. 101 backbone keys. Partial init when loaded into a DiT model via strict=False.
edlm-step2k-pure-mdlm.ckpt 2.7 GB DiT (HF kuleshov-group/mdlm-owt) + tiny_queries scorer head 2000 (post-pretrained) Best ckpt: 2000 steps of pure MDLM CE training on top of kuleshov-group/mdlm-owt. Achieves PPL=10.76 with uncommitted_soft @ 256 steps sampler (paper-aligned 128 samples eval).

Headline result

Vanilla kuleshov-group/mdlm-owt MDLM-1M backbone, paper-aligned 128-sample gen-PPL eval under gpt2-large evaluator (matches EDLM paper's eval protocol):

Setup sampler steps gen-PPL factor below paper-61
Vanilla MDLM (no train) ddpm_cache (paper) 1000 60.28 replicates paper baseline
Vanilla MDLM (no train) uncommitted_soft (ours) 256 17.04 3.5×
edlm-step2k-pure-mdlm.ckpt uncommitted_soft (ours) 256 10.76 5.7×

Winning training recipe

backbone: dit                          # (overridden by EBM hardcode → hf_dit)
ebm_backbone: tiny_queries
training:
  k_max_final: 1                       # NO rollouts (standard MDLM)
  soft_alpha: False                    # NO soft-α blending
  loss_on_all_positions: True
  threshold_commit_train: 0.9          # not used when k_max=1
  force_commit_strategy_train: none
sampling:
  predictor: uncommitted_soft
  steps: 256
  threshold: 0.9
  commit_sampling: True
  commit_temperature: 1.0
  force_commit_strategy: uniform

Negative results (avoid)

  • k_max_final > 1: training rollouts cause monotonic regression to PPL ~200 within 5k steps
  • soft_alpha: True: training-time soft-α blending drifts model away from OWT distribution
  • force_commit_strategy_train: uniform: similar regression; not additive
  • predictor: ddpm_cache at steps: 256: too few steps; gives PPL=87.58 instead of 60.28
  • commit_temperature < 1.0: mode collapse (PPL drops but text becomes repetitive)
  • steps > 512 with threshold < 0.9: degenerate samples (low entropy, repetitive lists)

Compatibility notes

  • mdlm.ckpt and ar.ckpt (Drive originals) match the local models/dit.py arch (sigma_map.mlp.*) but DO NOT load cleanly into the standard EDLM EBM model — gen produces garbage at PPL=109788. Different layer ordering or normalization details in their training script.
  • edlm-step2k-pure-mdlm.ckpt is a Lightning ckpt produced by training on top of kuleshov-group/mdlm-owt with the recipe above. It loads cleanly via Lightning.load_from_checkpoint or eval.partial_load_ckpt: True in the EDLM trainer.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support