SCDD / README.md
laaaarrywang's picture
Expand model card from paper introduction
a1d906f verified
metadata
license: other
library_name: pytorch
pipeline_tag: text-generation
tags:
  - discrete-diffusion
  - diffusion-language-model
  - self-correction
  - scdd
  - icml-2026
datasets:
  - openwebtext

SCDD

This repository contains the released checkpoints for Generalized Discrete Diffusion with Self-Correction, accepted at ICML 2026.

SCDD is a self-correcting discrete diffusion language model. It is designed to preserve the parallel generation advantage of masked diffusion models while allowing already visible tokens to be revised directly during the denoising process.

Introduction

Autoregressive language models generate text one token at a time. Masked diffusion language models instead use an order-agnostic denoising process, which can generate many positions in parallel and can reduce inference latency for long sequences. In practice, however, mainstream masked diffusion language models often decode only a limited number of tokens per step; decoding too many tokens can disrupt token dependencies and degrade generation quality.

Self-correction is a simple way to improve parallel generation: a model should be able to repair low-quality tokens from earlier denoising steps. Prior work has studied self-correction at inference time or through post-training, and GIDD studies pretraining-based self-correction with a multi-step BERT-style uniform-absorbing objective. The paper argues that GIDD's interpolation-based pipeline creates opaque interactions between uniform transitions and absorbing masks, and that its reverse process still retains remasking behavior.

SCDD reformulates pretraining-based self-correction in discrete time with explicit state transitions. The forward process combines absorbing-mask corruption and uniform token corruption. The backward process is derived from Bayes' rule and can revise visible tokens without sending them back to [MASK]. In the paper's formulation, SCDD also simplifies the training noise schedule, removes a redundant remasking step, and relies on uniform transitions to learn self-correction.

The paper reports experiments at GPT-2 scale on LM1B and OpenWebText. In these settings, SCDD improves few-step parallel generation quality and shows stronger self-correction behavior while preserving sample diversity as measured by unigram entropy.

Method Summary

For a clean token x, SCDD uses a marginal forward distribution of the form

q(z_t | x) = Cat(z_t; gamma_t (rho_t x + (1 - rho_t) u) + (1 - gamma_t) m),

where:

  • m is the [MASK] token.
  • u is the uniform distribution over non-[MASK] tokens.
  • gamma_t is the probability that z_t is not [MASK].
  • rho_t is the probability that z_t retains the clean token among non-[MASK] mass.

The two parameters separate the absorbing-mask signal-to-noise ratio from the uniform-transition signal-to-noise ratio. This decoupling gives separate control over masking and token corruption while keeping the marginal distribution explicit.

Under the monotone schedules used in the paper, the [MASK] state is absorbing in the forward process. This choice removes remasking from the reverse generation process: during sampling, visible tokens may transition directly to other visible tokens, and masked tokens continue to denoise in parallel.

What Is Released Here

This repository releases two OpenWebText SCDD checkpoints. They share the same architecture and training setup, and differ in the maximum uniform noise ratio p_u.

File Config Model Uniform noise ratio
checkpoints/scdd_pu_0.1.ckpt configs/scdd_pu_0.1.yaml SCDD (0.1) p_u = 0.1
checkpoints/scdd_pu_0.2.ckpt configs/scdd_pu_0.2.yaml SCDD (0.2) p_u = 0.2

The checkpoint filenames intentionally use scdd naming for the public release.

Model Configuration

Both checkpoints use the same GPT-2 scale DiT backbone and differ only in the SCDD uniform-noise ratio.

Setting Value
Backbone DiT / ddit
Parameterization scdd
Dataset OpenWebText
Tokenizer GPT-2
Context length 512
Hidden size 768
Number of blocks 12
Number of attention heads 12
Conditional dimension 128
Dropout 0.0
Diffusion steps used in training grid 1000
Forward process mix
gamma schedule-shape parameter 1
Uniform-noise peak time t_peak = 0.5
EMA 0.9999
Optimizer Adam-style optimizer, lr 5e-4, weight decay 0.02
Precision bfloat16

See configs/scdd_pu_0.1.yaml and configs/scdd_pu_0.2.yaml for sanitized public configuration files.

Reported Evaluation Context

The paper evaluates SCDD against MDLM, ReMDM, and GIDD+ baselines. It reports generative perplexity on LM1B and OpenWebText across multiple sampling-step budgets, and also reports unigram entropy as a sanity check against repetitive text. In the paper's Table 3, SCDD (p_u = 0.2) obtains the best generative perplexity in every reported LM1B and OWT sampling-step column.

The paper also studies correction behavior directly. In a controlled corruption-recovery experiment on OpenWebText validation sequences, SCDD modifies nearly all intentionally corrupted tokens and exactly recovers a large fraction of them after one denoising step. These experiments are meant to test whether token edits are meaningful corrections rather than frequent but unhelpful revisions.

The paper notes that standard zero-shot likelihood benchmarks do not explicitly measure the self-correction ability studied in the generation and correction experiments.

Code

Code, project page, and evaluation scripts are available at:

https://github.com/laaaarrywang/Self-Correcting-Discrete-Diffusion

Citation

@article{wang2026generalized,
  title={Generalized Discrete Diffusion with Self-Correction},
  author={Wang, Linxuan and Wang, Ziyi and Bai, Yikun and Deng, Wei and Lin, Guang and Song, Qifan},
  journal={arXiv preprint arXiv:2603.02230},
  year={2026}
}