SCDD / README.md
laaaarrywang's picture
Expand model card from paper introduction
a1d906f verified
---
license: other
library_name: pytorch
pipeline_tag: text-generation
tags:
- discrete-diffusion
- diffusion-language-model
- self-correction
- scdd
- icml-2026
datasets:
- openwebtext
---
# SCDD
This repository contains the released checkpoints for **Generalized Discrete Diffusion with Self-Correction**, accepted at ICML 2026.
SCDD is a self-correcting discrete diffusion language model. It is designed to preserve the parallel generation advantage of masked diffusion models while allowing already visible tokens to be revised directly during the denoising process.
## Introduction
Autoregressive language models generate text one token at a time. Masked diffusion language models instead use an order-agnostic denoising process, which can generate many positions in parallel and can reduce inference latency for long sequences. In practice, however, mainstream masked diffusion language models often decode only a limited number of tokens per step; decoding too many tokens can disrupt token dependencies and degrade generation quality.
Self-correction is a simple way to improve parallel generation: a model should be able to repair low-quality tokens from earlier denoising steps. Prior work has studied self-correction at inference time or through post-training, and GIDD studies pretraining-based self-correction with a multi-step BERT-style uniform-absorbing objective. The paper argues that GIDD's interpolation-based pipeline creates opaque interactions between uniform transitions and absorbing masks, and that its reverse process still retains remasking behavior.
SCDD reformulates pretraining-based self-correction in discrete time with explicit state transitions. The forward process combines absorbing-mask corruption and uniform token corruption. The backward process is derived from Bayes' rule and can revise visible tokens without sending them back to `[MASK]`. In the paper's formulation, SCDD also simplifies the training noise schedule, removes a redundant remasking step, and relies on uniform transitions to learn self-correction.
The paper reports experiments at GPT-2 scale on LM1B and OpenWebText. In these settings, SCDD improves few-step parallel generation quality and shows stronger self-correction behavior while preserving sample diversity as measured by unigram entropy.
## Method Summary
For a clean token `x`, SCDD uses a marginal forward distribution of the form
```text
q(z_t | x) = Cat(z_t; gamma_t (rho_t x + (1 - rho_t) u) + (1 - gamma_t) m),
```
where:
- `m` is the `[MASK]` token.
- `u` is the uniform distribution over non-`[MASK]` tokens.
- `gamma_t` is the probability that `z_t` is not `[MASK]`.
- `rho_t` is the probability that `z_t` retains the clean token among non-`[MASK]` mass.
The two parameters separate the absorbing-mask signal-to-noise ratio from the uniform-transition signal-to-noise ratio. This decoupling gives separate control over masking and token corruption while keeping the marginal distribution explicit.
Under the monotone schedules used in the paper, the `[MASK]` state is absorbing in the forward process. This choice removes remasking from the reverse generation process: during sampling, visible tokens may transition directly to other visible tokens, and masked tokens continue to denoise in parallel.
## What Is Released Here
This repository releases two OpenWebText SCDD checkpoints. They share the same architecture and training setup, and differ in the maximum uniform noise ratio `p_u`.
| File | Config | Model | Uniform noise ratio |
| --- | --- | --- | --- |
| `checkpoints/scdd_pu_0.1.ckpt` | `configs/scdd_pu_0.1.yaml` | SCDD (0.1) | `p_u = 0.1` |
| `checkpoints/scdd_pu_0.2.ckpt` | `configs/scdd_pu_0.2.yaml` | SCDD (0.2) | `p_u = 0.2` |
The checkpoint filenames intentionally use `scdd` naming for the public release.
## Model Configuration
Both checkpoints use the same GPT-2 scale DiT backbone and differ only in the SCDD uniform-noise ratio.
| Setting | Value |
| --- | --- |
| Backbone | DiT / `ddit` |
| Parameterization | `scdd` |
| Dataset | OpenWebText |
| Tokenizer | GPT-2 |
| Context length | 512 |
| Hidden size | 768 |
| Number of blocks | 12 |
| Number of attention heads | 12 |
| Conditional dimension | 128 |
| Dropout | 0.0 |
| Diffusion steps used in training grid | 1000 |
| Forward process | `mix` |
| `gamma` schedule-shape parameter | 1 |
| Uniform-noise peak time | `t_peak = 0.5` |
| EMA | 0.9999 |
| Optimizer | Adam-style optimizer, lr `5e-4`, weight decay `0.02` |
| Precision | bfloat16 |
See `configs/scdd_pu_0.1.yaml` and `configs/scdd_pu_0.2.yaml` for sanitized public configuration files.
## Reported Evaluation Context
The paper evaluates SCDD against MDLM, ReMDM, and GIDD+ baselines. It reports generative perplexity on LM1B and OpenWebText across multiple sampling-step budgets, and also reports unigram entropy as a sanity check against repetitive text. In the paper's Table 3, `SCDD (p_u = 0.2)` obtains the best generative perplexity in every reported LM1B and OWT sampling-step column.
The paper also studies correction behavior directly. In a controlled corruption-recovery experiment on OpenWebText validation sequences, SCDD modifies nearly all intentionally corrupted tokens and exactly recovers a large fraction of them after one denoising step. These experiments are meant to test whether token edits are meaningful corrections rather than frequent but unhelpful revisions.
The paper notes that standard zero-shot likelihood benchmarks do not explicitly measure the self-correction ability studied in the generation and correction experiments.
## Code
Code, project page, and evaluation scripts are available at:
<https://github.com/laaaarrywang/Self-Correcting-Discrete-Diffusion>
## Citation
```bibtex
@article{wang2026generalized,
title={Generalized Discrete Diffusion with Self-Correction},
author={Wang, Linxuan and Wang, Ziyi and Bai, Yikun and Deng, Wei and Lin, Guang and Song, Qifan},
journal={arXiv preprint arXiv:2603.02230},
year={2026}
}
```