Diffusers documentation

LLaDA2

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.37.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

LLaDA2

LLaDA2 is a family of discrete diffusion language models that generate text through block-wise iterative refinement. Instead of autoregressive token-by-token generation, LLaDA2 starts with a fully masked sequence and progressively unmasks tokens by confidence over multiple refinement steps.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from diffusers import BlockRefinementScheduler, LLaDA2Pipeline

model_id = "inclusionAI/LLaDA2.1-mini"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
scheduler = BlockRefinementScheduler()

pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
output = pipe(
    prompt="Write a short poem about the ocean.",
    gen_length=256,
    block_length=32,
    num_inference_steps=32,
    threshold=0.7,
    editing_threshold=0.5,
    max_post_steps=16,
    temperature=0.0,
)
print(output.texts[0])

Callbacks

Callbacks run after each refinement step. Pass callback_on_step_end_tensor_inputs to select which tensors are included in callback_kwargs. In the current implementation, block_x (the sequence window being refined) and transfer_index (mask-filling commit mask) are provided; return {"block_x": ...} from the callback to replace the window.

def on_step_end(pipe, step, timestep, callback_kwargs):
    block_x = callback_kwargs["block_x"]
    # Inspect or modify `block_x` here.
    return {"block_x": block_x}

out = pipe(
    prompt="Write a short poem.",
    callback_on_step_end=on_step_end,
    callback_on_step_end_tensor_inputs=["block_x"],
)

Recommended parameters

LLaDA2.1 models support two modes:

Mode threshold editing_threshold max_post_steps
Quality 0.7 0.5 16
Speed 0.5 None 16

Pass editing_threshold=None, 0.0, or a negative value to turn off post-mask editing.

For LLaDA2.0 models, disable editing by passing editing_threshold=None or 0.0.

For all models: block_length=32, temperature=0.0, num_inference_steps=32.

LLaDA2Pipeline

class diffusers.LLaDA2Pipeline

< >

( model: Any scheduler: BlockRefinementScheduler tokenizer: Any | None = None )

Pipeline for LLaDA2-style discrete diffusion text generation via block-wise iterative refinement.

This pipeline maintains a template sequence filled with a mask_token_id and refines it in blocks. In each refinement step, it samples candidate tokens for the active block and commits a subset based on confidence.

The model is expected to accept an attention mask and position_ids, and to return logits of shape [batch, seq, vocab_size].

__call__

< >

( prompt: str | list[str] | None = None messages: list[dict[str, str]] | None = None input_ids: torch.LongTensor | None = None use_chat_template: bool = True add_generation_prompt: bool = True gen_length: int = 2048 block_length: int = 32 num_inference_steps: int = 32 temperature: float = 0.0 top_p: float | None = None top_k: int | None = None sampling_method: str = 'multinomial' threshold: float = 0.7 editing_threshold: float | None = 0.5 max_post_steps: int = 16 minimal_topk: int = 1 eos_early_stop: bool = True eos_token_id: int | None = None mask_token_id: int | None = None generator: torch.Generator | None = None output_type: str = 'text' return_dict: bool = True callback_on_step_end: Callable[[int, int, dict], None] | PipelineCallback | MultiPipelineCallbacks | None = None callback_on_step_end_tensor_inputs: list[str] | None = None )

Parameters

  • prompt (str or List[str], optional) — Prompt text. When use_chat_template is True (default) and a tokenizer with a chat template is available, the prompt is wrapped in a chat message before tokenization.
  • messages (List[Dict[str, str]], optional) — Chat messages to encode (e.g. [{"role": "user", "content": "Hello"}]). Takes precedence over prompt when provided. Requires a tokenizer with apply_chat_template.
  • input_ids (torch.LongTensor, optional) — Pre-tokenized input IDs. Takes precedence over prompt and messages.
  • use_chat_template (bool, defaults to True) — Whether to wrap the prompt in a chat template.
  • add_generation_prompt (bool, defaults to True) — Whether to add the generation prompt when using chat templates.
  • gen_length (int) — Number of tokens to generate.
  • block_length (int) — Block size for refinement.
  • num_inference_steps (int) — Number of refinement steps per block.
  • temperature (float) — Sampling temperature.
  • top_p (float, optional) — Nucleus sampling cutoff.
  • top_k (int, optional) — Top-k sampling cutoff.
  • sampling_method (str) — Sampling method (auto, greedy, multinomial).
  • threshold (float) — Confidence threshold for committing tokens.
  • editing_threshold (float, optional) — Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask tokens in a block are resolved, the pipeline continues refining: if the model predicts a different token with confidence above this threshold, the existing token is replaced. Set to None, 0.0, or a negative value to disable editing. Defaults to 0.5.
  • max_post_steps (int) — Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only used when editing_threshold is enabled. Defaults to 16.
  • minimal_topk (int) — Minimum number of tokens to commit per step.
  • eos_early_stop (bool) — Whether to stop after committing EOS in a block.
  • eos_token_id (int, optional) — EOS token ID to use for early stopping.
  • mask_token_id (int, optional) — Mask token ID to use for the template.
  • generator (torch.Generator, optional) — RNG for sampling.
  • output_type (str, defaults to "text") — Output format. "text" decodes sequences into strings (requires a tokenizer). "seq" returns raw token ID sequences only.
  • return_dict (bool, optional, defaults to True) — Whether to return a LLaDA2PipelineOutput instead of a tuple.
  • callback_on_step_end (Callable or PipelineCallback, optional) — Callback executed after each refinement step with signature callback_on_step_end(self, step: int, timestep: int, callback_kwargs: Dict).
  • callback_on_step_end_tensor_inputs (List[str], optional) — Tensor keys to pass to the callback. Allowed keys: block_x, x0, x0_p, transfer_index, confidence, active_block.

Generate text with block-wise refinement.

Examples:

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from diffusers import BlockRefinementScheduler, LLaDA2Pipeline

>>> model_id = "inclusionAI/LLaDA2.1-mini"
>>> model = AutoModelForCausalLM.from_pretrained(
...     model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
... )
>>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
>>> scheduler = BlockRefinementScheduler()

>>> pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
>>> output = pipe(prompt="What is the meaning of life?", gen_length=256)
>>> print(output.texts[0])

LLaDA2PipelineOutput

class diffusers.LLaDA2PipelineOutput

< >

( sequences: torch.LongTensor texts: list[str] | None = None )

Update on GitHub