FS-DFM

File size: 2,934 Bytes

9ce66ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d88f2c
 
 
9ce66ce
 
 
 
 
954abf6
cc4ea7d
4d35b1c
 
 
9ce66ce
 
 
 
 
 
 
 
 
 
8d88f2c
 
 
 
 
 
 
 
 
9ce66ce
 
 
 
 
 
 
 
 
 
 
 
 
4d35b1c
9ce66ce
 
 
 
 
 
 
 
 
 
cc4ea7d
9ce66ce
 
 
cc4ea7d
 
9ce66ce
 
 
 
 
 
 
 
2dd61db

---
language:
- en
tags:
- diffusion
- discrete-flow-matching
- flow-matching
- ctmc
- text-generation
- language-modeling
- pytorch
library_name: pytorch
pipeline_tag: text-generation
license: other
---

# FS-DFM (Few-Step Discrete Flow-Matching)

**FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model**  
Amin Karimi Monsefi, Nikhil Bhendawade, Manuel R. Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova (Jan 9, 2026)  
ArXiv: 2509.20624

[Github Link](https://github.com/apple/ml-fs-dfm/tree/main)
[Paper Link](https://arxiv.org/abs/2509.20624)

FS-DFM is a **token-space diffusion / flow-matching language model** designed for **fast long-text generation** by explicitly training for a **user-specified step budget** (e.g., 1–8 steps), while preserving a CTMC-based discrete flow formulation.

## What’s in this repo

### Checkpoint files
- [`FS_DFM_checkpoint.pth`](FS_DFM_checkpoint.pth) — **FS-DFM 1.3B**, uniform source, **RK4 teacher distilled**
- [`DFM_checkpoint.pth`](DFM_checkpoint.pth) — **DFM 1.3B**, uniform source, DFM pretrained initialization


---

## Model summary

**Core idea (high level):**
- Condition the model on a **target inference step size/budget** and train it so that **one big step matches many small steps**.
- Use a **cumulative scalar** update to make large steps stable on the probability simplex.
- Use **student–teacher distillation** (Runge–Kutta shortcut teachers, EMA stabilization) to improve few-step fidelity.

**Formulation:** discrete flow-matching over a **CTMC** on token sequences; sampling uses custom solvers (e.g., `mixture_euler_with_cumulative_scalar`).


## Comparison of Methods

| ARM | DFM | FS-DFM (Ours) |
|-----|-----|---------------|
| ![ARM](arm.gif) | ![DFM](dfm.gif) | ![FS-DFM](fs_dfm.gif) |



---

## Architecture

From the paper’s implementation details:
- Backbone is a **DiT-style transformer** with **rotary attention**
- **Adaptive LayerNorm conditioning** in each block
- Conditioning includes **continuous time embedding** + **step-size embedding**
- Final linear head produces logits; conversion from logits to a CTMC generator + stepping happens in the solver

Tokenizer: **GPT-2 tokenizer**  
Training/eval packing: documents packed into **1024-token** blocks (EOS appended, then packed/concatenated).

---

## Training data & evaluation data

- Training: **FineWeb-Edu**
- Evaluation: **WikiText-103**

(See the paper for details and the exact preprocessing pipeline.)

---

## How to use

FS-DFM uses custom discrete solvers and is not a drop-in `transformers` model. The intended usage is via the official training/evaluation scripts.

> PLEASE SEE [OUR OFFICIAL GITHUB](https://github.com/apple/ml-fs-dfm/tree/main)

### 1) Install the official code
```bash
git clone https://github.com/apple/ml-fs-dfm
cd ml-fs-dfm

conda env create -f fsdfm_environment.yml
conda activate FSDFM

pip install -e .