Instructions to use yifanyu/I-DLM-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yifanyu/I-DLM-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yifanyu/I-DLM-32B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("yifanyu/I-DLM-32B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yifanyu/I-DLM-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yifanyu/I-DLM-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yifanyu/I-DLM-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/yifanyu/I-DLM-32B

SGLang

How to use yifanyu/I-DLM-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yifanyu/I-DLM-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yifanyu/I-DLM-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yifanyu/I-DLM-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yifanyu/I-DLM-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use yifanyu/I-DLM-32B with Docker Model Runner:
```
docker model run hf.co/yifanyu/I-DLM-32B
```

I-DLM-32B

Introspective Diffusion Language Model (32B) — a diffusion language model converted from Qwen3-32B that matches AR quality while enabling parallel token generation.

[Project Page] [Paper] [Code]

Highlights

Matches Qwen3-32B quality across 15 benchmarks (knowledge, math, code, instruction following)
Introspective Strided Decoding (ISD): single-pass generation + verification with p/q acceptance criterion
AR-compatible serving via SGLang (paged KV cache, continuous batching, CUDA graphs)

Results

Quality (I-DLM-32B vs baselines)

Benchmark	I-DLM-32B	Qwen3-32B (AR)	LLaDA-2.1-flash (100B)
ARC-C	97.0	96.8	91.0
MMLU	85.8	86.0	72.4
MMLU-Pro	79.3	79.8	-
GPQA-D	68.2	68.7	49.5
GSM8K	97.3	97.3	94.5
MATH-500	96.8	96.6	82.4
AIME-24	85.4	85.0	46.7
AIME-25	72.7	72.0	-
MathBench	93.3	93.5	-
HumanEval	95.7	95.7	81.1
MBPP	93.7	93.7	-
LiveCodeBench-v6	57.2	57.7	39.3
IFEval	87.1	87.1	83.0

Usage

Note: This model checkpoint is hosted on HuggingFace for weight distribution. For inference, please use our SGLang-based ISD pipeline which implements the Introspective Strided Decoding algorithm described in the paper. Direct loading via transformers is not currently supported for reproducing paper results.

Inference via SGLang (Recommended)

# Install
git clone https://github.com/Introspective-Diffusion/I-DLM.git
cd I-DLM/inference && bash install.sh

# Launch server
python -m sglang.launch_server \
    --model-path yifanyu/I-DLM-32B \
    --trust-remote-code --tp-size 1 --dtype bfloat16 \
    --mem-fraction-static 0.85 --max-running-requests 32 \
    --attention-backend flashinfer --dllm-algorithm IDLMBlockN \
    --dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
    --port 30000

# Generate
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Prove sqrt(2) is irrational."}],"max_tokens":4096}'

See the inference README for detailed setup, evaluation, and benchmarking.

Method

I-DLM recovers introspective consistency (AR models' inherent self-agreement) through:

Strict causal masking across both masked and clean tokens
Logit shift (Dream shift): hidden state at position i predicts token i+1
All-masked training: CE loss on both noisy and clean token positions

Training loss: L = CE_noisy + alpha * CE_clean

Related Models

Model	HuggingFace	Description
I-DLM-8B	yifanyu/I-DLM-8B	Converted from Qwen3-8B
I-DLM-32B	yifanyu/I-DLM-32B	Converted from Qwen3-32B
I-DLM-8B-LoRA	yifanyu/I-DLM-8B-lora-r128	Gated LoRA adapter (rank=128) for lossless R-ISD

Citation

@article{yu2026introspective,
  title={Introspective Diffusion Language Models},
  author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
          and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
          and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
          and Dao, Tri and Athiwaratkun, Ben and Zou, James
          and Lai, Fan and Xu, Chenfeng},
  journal={arXiv preprint arXiv:2604.11035},
  year={2026}
}