aminr8
/

FS-DFM

Text Generation

discrete-flow-matching

language-modeling

Model card Files Files and versions

FS-DFM / README.md

aminr8's picture

Update README.md

2dd61db verified 8 days ago

|

history blame contribute delete

2.93 kB

	---
	language:
	- en
	tags:
	- diffusion
	- discrete-flow-matching
	- flow-matching
	- ctmc
	- text-generation
	- language-modeling
	- pytorch
	library_name: pytorch
	pipeline_tag: text-generation
	license: other
	---

	# FS-DFM (Few-Step Discrete Flow-Matching)

	FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model
	Amin Karimi Monsefi, Nikhil Bhendawade, Manuel R. Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova (Jan 9, 2026)
	ArXiv: 2509.20624

	[Github Link](https://github.com/apple/ml-fs-dfm/tree/main)
	[Paper Link](https://arxiv.org/abs/2509.20624)

	FS-DFM is a token-space diffusion / flow-matching language model designed for fast long-text generation by explicitly training for a user-specified step budget (e.g., 1–8 steps), while preserving a CTMC-based discrete flow formulation.

	## What’s in this repo

	### Checkpoint files
	- [`FS_DFM_checkpoint.pth`](FS_DFM_checkpoint.pth) — FS-DFM 1.3B, uniform source, RK4 teacher distilled
	- [`DFM_checkpoint.pth`](DFM_checkpoint.pth) — DFM 1.3B, uniform source, DFM pretrained initialization


	---

	## Model summary

	Core idea (high level):
	- Condition the model on a target inference step size/budget and train it so that one big step matches many small steps.
	- Use a cumulative scalar update to make large steps stable on the probability simplex.
	- Use student–teacher distillation (Runge–Kutta shortcut teachers, EMA stabilization) to improve few-step fidelity.

	Formulation: discrete flow-matching over a CTMC on token sequences; sampling uses custom solvers (e.g., `mixture_euler_with_cumulative_scalar`).


	## Comparison of Methods

	\| ARM \| DFM \| FS-DFM (Ours) \|
	\|-----\|-----\|---------------\|
	\| ![ARM](arm.gif) \| ![DFM](dfm.gif) \| ![FS-DFM](fs_dfm.gif) \|



	---

	## Architecture

	From the paper’s implementation details:
	- Backbone is a DiT-style transformer with rotary attention
	- Adaptive LayerNorm conditioning in each block
	- Conditioning includes continuous time embedding + step-size embedding
	- Final linear head produces logits; conversion from logits to a CTMC generator + stepping happens in the solver

	Tokenizer: GPT-2 tokenizer
	Training/eval packing: documents packed into 1024-token blocks (EOS appended, then packed/concatenated).

	---

	## Training data & evaluation data

	- Training: FineWeb-Edu
	- Evaluation: WikiText-103

	(See the paper for details and the exact preprocessing pipeline.)

	---

	## How to use

	FS-DFM uses custom discrete solvers and is not a drop-in `transformers` model. The intended usage is via the official training/evaluation scripts.

	> PLEASE SEE [OUR OFFICIAL GITHUB](https://github.com/apple/ml-fs-dfm/tree/main)

	### 1) Install the official code
	```bash
	git clone https://github.com/apple/ml-fs-dfm
	cd ml-fs-dfm

	conda env create -f fsdfm_environment.yml
	conda activate FSDFM

	pip install -e .