Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,90 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
A PyTorch implementation of FS-DFM with custom solvers for efficient text generation and discrete sequence modeling. This software project accompanies the research paper, [FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models](https://arxiv.org/abs/2509.20624) . [Github Repository](https://github.com/apple/ml-fs-dfm)
|
| 4 |
|
| 5 |
---
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- diffusion
|
| 6 |
+
- discrete-flow-matching
|
| 7 |
+
- flow-matching
|
| 8 |
+
- ctmc
|
| 9 |
+
- text-generation
|
| 10 |
+
- language-modeling
|
| 11 |
+
- pytorch
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
pipeline_tag: text-generation
|
| 14 |
+
license: other
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# FS-DFM (Few-Step Discrete Flow-Matching)
|
| 18 |
+
|
| 19 |
+
This repository provides **FS-DFM checkpoints** from the paper:
|
| 20 |
+
|
| 21 |
+
**FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model**
|
| 22 |
+
Amin Karimi Monsefi, Nikhil Bhendawade, Manuel R. Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova (Jan 9, 2026)
|
| 23 |
+
ArXiv: 2509.20624
|
| 24 |
+
|
| 25 |
+
FS-DFM is a **token-space diffusion / flow-matching language model** designed for **fast long-text generation** by explicitly training for a **user-specified step budget** (e.g., 1–8 steps), while preserving a CTMC-based discrete flow formulation.
|
| 26 |
+
|
| 27 |
+
## What’s in this repo
|
| 28 |
+
|
| 29 |
+
### Checkpoint files
|
| 30 |
+
- `FS_DFM_checkpoint.pth` — **FS-DFM 1.3B**, uniform source, **RK4 teacher distilled**
|
| 31 |
+
- `DFM_checkpoint.pth` — **DFM 1.3B**, uniform source, DFM pretrained initialization
|
| 32 |
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
+
|
| 36 |
+
## Model summary
|
| 37 |
+
|
| 38 |
+
**Core idea (high level):**
|
| 39 |
+
- Condition the model on a **target inference step size/budget** and train it so that **one big step matches many small steps**.
|
| 40 |
+
- Use a **cumulative scalar** update to make large steps stable on the probability simplex.
|
| 41 |
+
- Use **student–teacher distillation** (Runge–Kutta shortcut teachers, EMA stabilization) to improve few-step fidelity.
|
| 42 |
+
|
| 43 |
+
**Formulation:** discrete flow-matching over a **CTMC** on token sequences; sampling uses custom solvers (e.g., `mixture_euler_with_cumulative_scalar`).
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Architecture
|
| 48 |
+
|
| 49 |
+
From the paper’s implementation details:
|
| 50 |
+
- Backbone is a **DiT-style transformer** with **rotary attention**
|
| 51 |
+
- **Adaptive LayerNorm conditioning** in each block
|
| 52 |
+
- Conditioning includes **continuous time embedding** + **step-size embedding**
|
| 53 |
+
- Final linear head produces logits; conversion from logits to a CTMC generator + stepping happens in the solver
|
| 54 |
+
|
| 55 |
+
Tokenizer: **GPT-2 tokenizer**
|
| 56 |
+
Training/eval packing: documents packed into **1024-token** blocks (EOS appended, then packed/concatenated).
|
| 57 |
+
|
| 58 |
---
|
| 59 |
+
|
| 60 |
+
## Training data & evaluation data
|
| 61 |
+
|
| 62 |
+
- Training: **FineWeb-Edu**
|
| 63 |
+
- Evaluation: **WikiText-103**
|
| 64 |
+
|
| 65 |
+
(See the paper for details and the exact preprocessing pipeline.)
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Reported behavior (paper)
|
| 70 |
+
|
| 71 |
+
FS-DFM targets **long-horizon language modeling**. In the paper, **8-step sampling** is reported to reach **perplexity parity** with a **1024-step** discrete-flow baseline for **1024-token generation**, yielding up to **128× fewer model evaluations**.
|
| 72 |
+
|
| 73 |
+
> For exact numbers/plots and ablations, refer to the paper.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## How to use (recommended)
|
| 78 |
+
|
| 79 |
+
FS-DFM uses custom discrete solvers and is not a drop-in `transformers` model. The intended usage is via the official training/evaluation scripts.
|
| 80 |
+
|
| 81 |
+
### 1) Install the official code
|
| 82 |
+
```bash
|
| 83 |
+
git clone https://github.com/apple/ml-fs-dfm
|
| 84 |
+
cd ml-fs-dfm
|
| 85 |
+
|
| 86 |
+
conda env create -f fsdfm_environment.yml
|
| 87 |
+
conda activate FSDFM
|
| 88 |
+
|
| 89 |
+
pip install -e .
|
| 90 |
+
|