aminr8
/

FS-DFM

@@ -1,7 +1,90 @@
-# FS-DFM: FAST AND ACCURATE LONG TEXT GENERATION WITH FEW-STEP DIFFUSION LANGUAGE MODELS
-A PyTorch implementation of FS-DFM with custom solvers for efficient text generation and discrete sequence modeling. This software project accompanies the research paper, [FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models](https://arxiv.org/abs/2509.20624) . [Github Repository](https://github.com/apple/ml-fs-dfm)
 ---
-license: mit
 ---

+---
+language:
+- en
+tags:
+- diffusion
+- discrete-flow-matching
+- flow-matching
+- ctmc
+- text-generation
+- language-modeling
+- pytorch
+library_name: pytorch
+pipeline_tag: text-generation
+license: other
+---
+# FS-DFM (Few-Step Discrete Flow-Matching)
+This repository provides **FS-DFM checkpoints** from the paper:
+**FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model**
+Amin Karimi Monsefi, Nikhil Bhendawade, Manuel R. Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova (Jan 9, 2026)
+ArXiv: 2509.20624
+FS-DFM is a **token-space diffusion / flow-matching language model** designed for **fast long-text generation** by explicitly training for a **user-specified step budget** (e.g., 1–8 steps), while preserving a CTMC-based discrete flow formulation.
+## What’s in this repo
+### Checkpoint files
+- `FS_DFM_checkpoint.pth` — **FS-DFM 1.3B**, uniform source, **RK4 teacher distilled**
+- `DFM_checkpoint.pth` — **DFM 1.3B**, uniform source, DFM pretrained initialization
 ---
+## Model summary
+**Core idea (high level):**
+- Condition the model on a **target inference step size/budget** and train it so that **one big step matches many small steps**.
+- Use a **cumulative scalar** update to make large steps stable on the probability simplex.
+- Use **student–teacher distillation** (Runge–Kutta shortcut teachers, EMA stabilization) to improve few-step fidelity.
+**Formulation:** discrete flow-matching over a **CTMC** on token sequences; sampling uses custom solvers (e.g., `mixture_euler_with_cumulative_scalar`).
+---
+## Architecture
+From the paper’s implementation details:
+- Backbone is a **DiT-style transformer** with **rotary attention**
+- **Adaptive LayerNorm conditioning** in each block
+- Conditioning includes **continuous time embedding** + **step-size embedding**
+- Final linear head produces logits; conversion from logits to a CTMC generator + stepping happens in the solver
+Tokenizer: **GPT-2 tokenizer**
+Training/eval packing: documents packed into **1024-token** blocks (EOS appended, then packed/concatenated).
 ---
+## Training data & evaluation data
+- Training: **FineWeb-Edu**
+- Evaluation: **WikiText-103**
+(See the paper for details and the exact preprocessing pipeline.)
+---
+## Reported behavior (paper)
+FS-DFM targets **long-horizon language modeling**. In the paper, **8-step sampling** is reported to reach **perplexity parity** with a **1024-step** discrete-flow baseline for **1024-token generation**, yielding up to **128× fewer model evaluations**.
+> For exact numbers/plots and ablations, refer to the paper.
+---
+## How to use (recommended)
+FS-DFM uses custom discrete solvers and is not a drop-in `transformers` model. The intended usage is via the official training/evaluation scripts.
+### 1) Install the official code
+```bash
+git clone https://github.com/apple/ml-fs-dfm
+cd ml-fs-dfm
+conda env create -f fsdfm_environment.yml
+conda activate FSDFM
+pip install -e .