Buckets:

rtrm's picture
|
download
raw
11.6 kB
# LLaDA2
[LLaDA2](https://huggingface.co/collections/inclusionAI/llada21) is a family of discrete diffusion language models
that generate text through block-wise iterative refinement. Instead of autoregressive token-by-token generation,
LLaDA2 starts with a fully masked sequence and progressively unmasks tokens by confidence over multiple refinement
steps.
## Usage
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import BlockRefinementScheduler, LLaDA2Pipeline
model_id = "inclusionAI/LLaDA2.1-mini"
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
scheduler = BlockRefinementScheduler()
pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
output = pipe(
prompt="Write a short poem about the ocean.",
gen_length=256,
block_length=32,
num_inference_steps=32,
threshold=0.7,
editing_threshold=0.5,
max_post_steps=16,
temperature=0.0,
)
print(output.texts[0])
```
## Callbacks
Callbacks run after each refinement step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are
included in `callback_kwargs`. In the current implementation, `block_x` (the sequence window being refined) and
`transfer_index` (mask-filling commit mask) are provided; return `{"block_x": ...}` from the callback to replace the
window.
```py
def on_step_end(pipe, step, timestep, callback_kwargs):
block_x = callback_kwargs["block_x"]
# Inspect or modify `block_x` here.
return {"block_x": block_x}
out = pipe(
prompt="Write a short poem.",
callback_on_step_end=on_step_end,
callback_on_step_end_tensor_inputs=["block_x"],
)
```
## Recommended parameters
LLaDA2.1 models support two modes:
| Mode | `threshold` | `editing_threshold` | `max_post_steps` |
|------|-------------|---------------------|------------------|
| Quality | 0.7 | 0.5 | 16 |
| Speed | 0.5 | `None` | 16 |
Pass `editing_threshold=None`, `0.0`, or a negative value to turn off post-mask editing.
For LLaDA2.0 models, disable editing by passing `editing_threshold=None` or `0.0`.
For all models: `block_length=32`, `temperature=0.0`, `num_inference_steps=32`.
## LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]]
#### diffusers.LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L59)
Pipeline for LLaDA2-style discrete diffusion text generation via block-wise iterative refinement.
This pipeline maintains a template sequence filled with a `mask_token_id` and refines it in blocks. In each
refinement step, it samples candidate tokens for the active block and commits a subset based on confidence.
The model is expected to accept an attention mask and `position_ids`, and to return logits of shape `[batch, seq,
vocab_size]`.
__call__diffusers.LLaDA2Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L211[{"name": "prompt", "val": ": str | list[str] | None = None"}, {"name": "messages", "val": ": list[dict[str, str]] | None = None"}, {"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "use_chat_template", "val": ": bool = True"}, {"name": "add_generation_prompt", "val": ": bool = True"}, {"name": "gen_length", "val": ": int = 2048"}, {"name": "block_length", "val": ": int = 32"}, {"name": "num_inference_steps", "val": ": int = 32"}, {"name": "temperature", "val": ": float = 0.0"}, {"name": "top_p", "val": ": float | None = None"}, {"name": "top_k", "val": ": int | None = None"}, {"name": "sampling_method", "val": ": str = 'multinomial'"}, {"name": "threshold", "val": ": float = 0.7"}, {"name": "editing_threshold", "val": ": float | None = 0.5"}, {"name": "max_post_steps", "val": ": int = 16"}, {"name": "minimal_topk", "val": ": int = 1"}, {"name": "eos_early_stop", "val": ": bool = True"}, {"name": "eos_token_id", "val": ": int | None = None"}, {"name": "mask_token_id", "val": ": int | None = None"}, {"name": "generator", "val": ": torch.Generator | None = None"}, {"name": "output_type", "val": ": str = 'text'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": Callable[[int, int, dict], None] | PipelineCallback | MultiPipelineCallbacks | None = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list[str] | None = None"}]- **prompt** (`str` or `List[str]`, *optional*) --
Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is
available, the prompt is wrapped in a chat message before tokenization.
- **messages** (`List[Dict[str, str]]`, *optional*) --
Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt`
when provided. Requires a tokenizer with `apply_chat_template`.
- **input_ids** (`torch.LongTensor`, *optional*) --
Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`.
- **use_chat_template** (`bool`, defaults to `True`) --
Whether to wrap the prompt in a chat template.
- **add_generation_prompt** (`bool`, defaults to `True`) --
Whether to add the generation prompt when using chat templates.
- **gen_length** (`int`) --
Number of tokens to generate.
- **block_length** (`int`) --
Block size for refinement.
- **num_inference_steps** (`int`) --
Number of refinement steps per block.
- **temperature** (`float`) --
Sampling temperature.
- **top_p** (`float`, *optional*) --
Nucleus sampling cutoff.
- **top_k** (`int`, *optional*) --
Top-k sampling cutoff.
- **sampling_method** (`str`) --
Sampling method (`auto`, `greedy`, `multinomial`).
- **threshold** (`float`) --
Confidence threshold for committing tokens.
- **editing_threshold** (`float`, *optional*) --
Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask
tokens in a block are resolved, the pipeline continues refining: if the model predicts a different
token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a
negative value to disable editing. Defaults to `0.5`.
- **max_post_steps** (`int`) --
Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only
used when `editing_threshold` is enabled. Defaults to `16`.
- **minimal_topk** (`int`) --
Minimum number of tokens to commit per step.
- **eos_early_stop** (`bool`) --
Whether to stop after committing EOS in a block.
- **eos_token_id** (`int`, *optional*) --
EOS token ID to use for early stopping.
- **mask_token_id** (`int`, *optional*) --
Mask token ID to use for the template.
- **generator** (`torch.Generator`, *optional*) --
RNG for sampling.
- **output_type** (`str`, defaults to `"text"`) --
Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw
token ID sequences only.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13360/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple.
- **callback_on_step_end** (`Callable` or `PipelineCallback`, *optional*) --
Callback executed after each refinement step with signature `callback_on_step_end(self, step: int,
timestep: int, callback_kwargs: Dict)`.
- **callback_on_step_end_tensor_inputs** (`List[str]`, *optional*) --
Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`,
`confidence`, `active_block`.0
Generate text with block-wise refinement.
Examples:
```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from diffusers import BlockRefinementScheduler, LLaDA2Pipeline
>>> model_id = "inclusionAI/LLaDA2.1-mini"
>>> model = AutoModelForCausalLM.from_pretrained(
... model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
... )
>>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
>>> scheduler = BlockRefinementScheduler()
>>> pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
>>> output = pipe(prompt="What is the meaning of life?", gen_length=256)
>>> print(output.texts[0])
```
**Parameters:**
prompt (`str` or `List[str]`, *optional*) : Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is available, the prompt is wrapped in a chat message before tokenization.
messages (`List[Dict[str, str]]`, *optional*) : Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt` when provided. Requires a tokenizer with `apply_chat_template`.
input_ids (`torch.LongTensor`, *optional*) : Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`.
use_chat_template (`bool`, defaults to `True`) : Whether to wrap the prompt in a chat template.
add_generation_prompt (`bool`, defaults to `True`) : Whether to add the generation prompt when using chat templates.
gen_length (`int`) : Number of tokens to generate.
block_length (`int`) : Block size for refinement.
num_inference_steps (`int`) : Number of refinement steps per block.
temperature (`float`) : Sampling temperature.
top_p (`float`, *optional*) : Nucleus sampling cutoff.
top_k (`int`, *optional*) : Top-k sampling cutoff.
sampling_method (`str`) : Sampling method (`auto`, `greedy`, `multinomial`).
threshold (`float`) : Confidence threshold for committing tokens.
editing_threshold (`float`, *optional*) : Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask tokens in a block are resolved, the pipeline continues refining: if the model predicts a different token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a negative value to disable editing. Defaults to `0.5`.
max_post_steps (`int`) : Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only used when `editing_threshold` is enabled. Defaults to `16`.
minimal_topk (`int`) : Minimum number of tokens to commit per step.
eos_early_stop (`bool`) : Whether to stop after committing EOS in a block.
eos_token_id (`int`, *optional*) : EOS token ID to use for early stopping.
mask_token_id (`int`, *optional*) : Mask token ID to use for the template.
generator (`torch.Generator`, *optional*) : RNG for sampling.
output_type (`str`, defaults to `"text"`) : Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw token ID sequences only.
return_dict (`bool`, *optional*, defaults to `True`) : Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13360/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple.
callback_on_step_end (`Callable` or `PipelineCallback`, *optional*) : Callback executed after each refinement step with signature `callback_on_step_end(self, step: int, timestep: int, callback_kwargs: Dict)`.
callback_on_step_end_tensor_inputs (`List[str]`, *optional*) : Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`, `confidence`, `active_block`.
## LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]]
#### diffusers.LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13360/src/diffusers/pipelines/llada2/pipeline_llada2.py#L54)

Xet Storage Details

Size:
11.6 kB
·
Xet hash:
9d1ecec045adc5b1fa5d10fa3941fbe70bd89186dd4e3c393fa8bc0424d573d4

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.