Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_13360 /en /api /pipelines /llada2.md

rtrm

3 months ago

preview code

download

raw

11.6 kB

	# LLaDA2

	[LLaDA2](https://huggingface.co/collections/inclusionAI/llada21) is a family of discrete diffusion language models
	that generate text through block-wise iterative refinement. Instead of autoregressive token-by-token generation,
	LLaDA2 starts with a fully masked sequence and progressively unmasks tokens by confidence over multiple refinement
	steps.

	## Usage

	```py
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	from diffusers import BlockRefinementScheduler, LLaDA2Pipeline

	model_id = "inclusionAI/LLaDA2.1-mini"
	model = AutoModelForCausalLM.from_pretrained(
	model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	scheduler = BlockRefinementScheduler()

	pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
	output = pipe(
	prompt="Write a short poem about the ocean.",
	gen_length=256,
	block_length=32,
	num_inference_steps=32,
	threshold=0.7,
	editing_threshold=0.5,
	max_post_steps=16,
	temperature=0.0,
	)
	print(output.texts[0])
	```

	## Callbacks

	Callbacks run after each refinement step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are
	included in `callback_kwargs`. In the current implementation, `block_x` (the sequence window being refined) and
	`transfer_index` (mask-filling commit mask) are provided; return `{"block_x": ...}` from the callback to replace the
	window.

	```py
	def on_step_end(pipe, step, timestep, callback_kwargs):
	block_x = callback_kwargs["block_x"]
	# Inspect or modify `block_x` here.
	return {"block_x": block_x}

	out = pipe(
	prompt="Write a short poem.",
	callback_on_step_end=on_step_end,
	callback_on_step_end_tensor_inputs=["block_x"],
	)
	```

	## Recommended parameters

	LLaDA2.1 models support two modes:

	\| Mode \| `threshold` \| `editing_threshold` \| `max_post_steps` \|
	\|------\|-------------\|---------------------\|------------------\|
	\| Quality \| 0.7 \| 0.5 \| 16 \|
	\| Speed \| 0.5 \| `None` \| 16 \|

	Pass `editing_threshold=None`, `0.0`, or a negative value to turn off post-mask editing.

	For LLaDA2.0 models, disable editing by passing `editing_threshold=None` or `0.0`.

	For all models: `block_length=32`, `temperature=0.0`, `num_inference_steps=32`.

	## LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]]
	#### diffusers.LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L59)

	Pipeline for LLaDA2-style discrete diffusion text generation via block-wise iterative refinement.

	This pipeline maintains a template sequence filled with a `mask_token_id` and refines it in blocks. In each
	refinement step, it samples candidate tokens for the active block and commits a subset based on confidence.

	The model is expected to accept an attention mask and `position_ids`, and to return logits of shape `[batch, seq,
	vocab_size]`.

	__call__diffusers.LLaDA2Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L211[{"name": "prompt", "val": ": str \| list[str] \| None = None"}, {"name": "messages", "val": ": list[dict[str, str]] \| None = None"}, {"name": "input_ids", "val": ": torch.LongTensor \| None = None"}, {"name": "use_chat_template", "val": ": bool = True"}, {"name": "add_generation_prompt", "val": ": bool = True"}, {"name": "gen_length", "val": ": int = 2048"}, {"name": "block_length", "val": ": int = 32"}, {"name": "num_inference_steps", "val": ": int = 32"}, {"name": "temperature", "val": ": float = 0.0"}, {"name": "top_p", "val": ": float \| None = None"}, {"name": "top_k", "val": ": int \| None = None"}, {"name": "sampling_method", "val": ": str = 'multinomial'"}, {"name": "threshold", "val": ": float = 0.7"}, {"name": "editing_threshold", "val": ": float \| None = 0.5"}, {"name": "max_post_steps", "val": ": int = 16"}, {"name": "minimal_topk", "val": ": int = 1"}, {"name": "eos_early_stop", "val": ": bool = True"}, {"name": "eos_token_id", "val": ": int \| None = None"}, {"name": "mask_token_id", "val": ": int \| None = None"}, {"name": "generator", "val": ": torch.Generator \| None = None"}, {"name": "output_type", "val": ": str = 'text'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": Callable[[int, int, dict], None] \| PipelineCallback \| MultiPipelineCallbacks \| None = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list[str] \| None = None"}]- prompt (`str` or `List[str]`, optional) --
	Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is
	available, the prompt is wrapped in a chat message before tokenization.
	- messages (`List[Dict[str, str]]`, optional) --
	Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt`
	when provided. Requires a tokenizer with `apply_chat_template`.
	- input_ids (`torch.LongTensor`, optional) --
	Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`.
	- use_chat_template (`bool`, defaults to `True`) --
	Whether to wrap the prompt in a chat template.
	- add_generation_prompt (`bool`, defaults to `True`) --
	Whether to add the generation prompt when using chat templates.
	- gen_length (`int`) --
	Number of tokens to generate.
	- block_length (`int`) --
	Block size for refinement.
	- num_inference_steps (`int`) --
	Number of refinement steps per block.
	- temperature (`float`) --
	Sampling temperature.
	- top_p (`float`, optional) --
	Nucleus sampling cutoff.
	- top_k (`int`, optional) --
	Top-k sampling cutoff.
	- sampling_method (`str`) --
	Sampling method (`auto`, `greedy`, `multinomial`).
	- threshold (`float`) --
	Confidence threshold for committing tokens.
	- editing_threshold (`float`, optional) --
	Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask
	tokens in a block are resolved, the pipeline continues refining: if the model predicts a different
	token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a
	negative value to disable editing. Defaults to `0.5`.
	- max_post_steps (`int`) --
	Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only
	used when `editing_threshold` is enabled. Defaults to `16`.
	- minimal_topk (`int`) --
	Minimum number of tokens to commit per step.
	- eos_early_stop (`bool`) --
	Whether to stop after committing EOS in a block.
	- eos_token_id (`int`, optional) --
	EOS token ID to use for early stopping.
	- mask_token_id (`int`, optional) --
	Mask token ID to use for the template.
	- generator (`torch.Generator`, optional) --
	RNG for sampling.
	- output_type (`str`, defaults to `"text"`) --
	Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw
	token ID sequences only.
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13360/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple.
	- callback_on_step_end (`Callable` or `PipelineCallback`, optional) --
	Callback executed after each refinement step with signature `callback_on_step_end(self, step: int,
	timestep: int, callback_kwargs: Dict)`.
	- callback_on_step_end_tensor_inputs (`List[str]`, optional) --
	Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`,
	`confidence`, `active_block`.0

	Generate text with block-wise refinement.

	Examples:
	```python
	>>> import torch
	>>> from transformers import AutoModelForCausalLM, AutoTokenizer
	>>> from diffusers import BlockRefinementScheduler, LLaDA2Pipeline

	>>> model_id = "inclusionAI/LLaDA2.1-mini"
	>>> model = AutoModelForCausalLM.from_pretrained(
	... model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
	... )
	>>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	>>> scheduler = BlockRefinementScheduler()

	>>> pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
	>>> output = pipe(prompt="What is the meaning of life?", gen_length=256)
	>>> print(output.texts[0])
	```

	Parameters:

	prompt (`str` or `List[str]`, optional) : Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is available, the prompt is wrapped in a chat message before tokenization.

	messages (`List[Dict[str, str]]`, optional) : Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt` when provided. Requires a tokenizer with `apply_chat_template`.

	input_ids (`torch.LongTensor`, optional) : Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`.

	use_chat_template (`bool`, defaults to `True`) : Whether to wrap the prompt in a chat template.

	add_generation_prompt (`bool`, defaults to `True`) : Whether to add the generation prompt when using chat templates.

	gen_length (`int`) : Number of tokens to generate.

	block_length (`int`) : Block size for refinement.

	num_inference_steps (`int`) : Number of refinement steps per block.

	temperature (`float`) : Sampling temperature.

	top_p (`float`, optional) : Nucleus sampling cutoff.

	top_k (`int`, optional) : Top-k sampling cutoff.

	sampling_method (`str`) : Sampling method (`auto`, `greedy`, `multinomial`).

	threshold (`float`) : Confidence threshold for committing tokens.

	editing_threshold (`float`, optional) : Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask tokens in a block are resolved, the pipeline continues refining: if the model predicts a different token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a negative value to disable editing. Defaults to `0.5`.

	max_post_steps (`int`) : Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only used when `editing_threshold` is enabled. Defaults to `16`.

	minimal_topk (`int`) : Minimum number of tokens to commit per step.

	eos_early_stop (`bool`) : Whether to stop after committing EOS in a block.

	eos_token_id (`int`, optional) : EOS token ID to use for early stopping.

	mask_token_id (`int`, optional) : Mask token ID to use for the template.

	generator (`torch.Generator`, optional) : RNG for sampling.

	output_type (`str`, defaults to `"text"`) : Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw token ID sequences only.

	return_dict (`bool`, optional, defaults to `True`) : Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13360/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple.

	callback_on_step_end (`Callable` or `PipelineCallback`, optional) : Callback executed after each refinement step with signature `callback_on_step_end(self, step: int, timestep: int, callback_kwargs: Dict)`.

	callback_on_step_end_tensor_inputs (`List[str]`, optional) : Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`, `confidence`, `active_block`.

	## LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]]
	#### diffusers.LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L54)

Xet Storage Details

Size:: 11.6 kB
Xet hash:: 9d1ecec045adc5b1fa5d10fa3941fbe70bd89186dd4e3c393fa8bc0424d573d4

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.