Buckets:
| # LLaDA2 | |
| [LLaDA2](https://huggingface.co/collections/inclusionAI/llada21) is a family of discrete diffusion language models | |
| that generate text through block-wise iterative refinement. Instead of autoregressive token-by-token generation, | |
| LLaDA2 starts with a fully masked sequence and progressively unmasks tokens by confidence over multiple refinement | |
| steps. | |
| ## Usage | |
| ```py | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from diffusers import BlockRefinementScheduler, LLaDA2Pipeline | |
| model_id = "inclusionAI/LLaDA2.1-mini" | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| scheduler = BlockRefinementScheduler() | |
| pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer) | |
| output = pipe( | |
| prompt="Write a short poem about the ocean.", | |
| gen_length=256, | |
| block_length=32, | |
| num_inference_steps=32, | |
| threshold=0.7, | |
| editing_threshold=0.5, | |
| max_post_steps=16, | |
| temperature=0.0, | |
| ) | |
| print(output.texts[0]) | |
| ``` | |
| ## Callbacks | |
| Callbacks run after each refinement step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are | |
| included in `callback_kwargs`. In the current implementation, `block_x` (the sequence window being refined) and | |
| `transfer_index` (mask-filling commit mask) are provided; return `{"block_x": ...}` from the callback to replace the | |
| window. | |
| ```py | |
| def on_step_end(pipe, step, timestep, callback_kwargs): | |
| block_x = callback_kwargs["block_x"] | |
| # Inspect or modify `block_x` here. | |
| return {"block_x": block_x} | |
| out = pipe( | |
| prompt="Write a short poem.", | |
| callback_on_step_end=on_step_end, | |
| callback_on_step_end_tensor_inputs=["block_x"], | |
| ) | |
| ``` | |
| ## Recommended parameters | |
| LLaDA2.1 models support two modes: | |
| | Mode | `threshold` | `editing_threshold` | `max_post_steps` | | |
| |------|-------------|---------------------|------------------| | |
| | Quality | 0.7 | 0.5 | 16 | | |
| | Speed | 0.5 | `None` | 16 | | |
| Pass `editing_threshold=None`, `0.0`, or a negative value to turn off post-mask editing. | |
| For LLaDA2.0 models, disable editing by passing `editing_threshold=None` or `0.0`. | |
| For all models: `block_length=32`, `temperature=0.0`, `num_inference_steps=32`. | |
| ## LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]] | |
| #### diffusers.LLaDA2Pipeline[[diffusers.LLaDA2Pipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13486/src/diffusers/pipelines/llada2/pipeline_llada2.py#L59) | |
| Pipeline for LLaDA2-style discrete diffusion text generation via block-wise iterative refinement. | |
| This pipeline maintains a template sequence filled with a `mask_token_id` and refines it in blocks. In each | |
| refinement step, it samples candidate tokens for the active block and commits a subset based on confidence. | |
| The model is expected to accept an attention mask and `position_ids`, and to return logits of shape `[batch, seq, | |
| vocab_size]`. | |
| __call__diffusers.LLaDA2Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13486/src/diffusers/pipelines/llada2/pipeline_llada2.py#L211[{"name": "prompt", "val": ": str | list[str] | None = None"}, {"name": "messages", "val": ": list[dict[str, str]] | None = None"}, {"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "use_chat_template", "val": ": bool = True"}, {"name": "add_generation_prompt", "val": ": bool = True"}, {"name": "gen_length", "val": ": int = 2048"}, {"name": "block_length", "val": ": int = 32"}, {"name": "num_inference_steps", "val": ": int = 32"}, {"name": "temperature", "val": ": float = 0.0"}, {"name": "top_p", "val": ": float | None = None"}, {"name": "top_k", "val": ": int | None = None"}, {"name": "sampling_method", "val": ": str = 'multinomial'"}, {"name": "threshold", "val": ": float = 0.7"}, {"name": "editing_threshold", "val": ": float | None = 0.5"}, {"name": "max_post_steps", "val": ": int = 16"}, {"name": "minimal_topk", "val": ": int = 1"}, {"name": "eos_early_stop", "val": ": bool = True"}, {"name": "eos_token_id", "val": ": int | None = None"}, {"name": "mask_token_id", "val": ": int | None = None"}, {"name": "generator", "val": ": torch.Generator | None = None"}, {"name": "output_type", "val": ": str = 'text'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": Callable[[int, int, dict], None] | PipelineCallback | MultiPipelineCallbacks | None = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list[str] | None = None"}]- **prompt** (`str` or `List[str]`, *optional*) -- | |
| Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is | |
| available, the prompt is wrapped in a chat message before tokenization. | |
| - **messages** (`List[Dict[str, str]]`, *optional*) -- | |
| Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt` | |
| when provided. Requires a tokenizer with `apply_chat_template`. | |
| - **input_ids** (`torch.LongTensor`, *optional*) -- | |
| Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`. | |
| - **use_chat_template** (`bool`, defaults to `True`) -- | |
| Whether to wrap the prompt in a chat template. | |
| - **add_generation_prompt** (`bool`, defaults to `True`) -- | |
| Whether to add the generation prompt when using chat templates. | |
| - **gen_length** (`int`) -- | |
| Number of tokens to generate. | |
| - **block_length** (`int`) -- | |
| Block size for refinement. | |
| - **num_inference_steps** (`int`) -- | |
| Number of refinement steps per block. | |
| - **temperature** (`float`) -- | |
| Sampling temperature. | |
| - **top_p** (`float`, *optional*) -- | |
| Nucleus sampling cutoff. | |
| - **top_k** (`int`, *optional*) -- | |
| Top-k sampling cutoff. | |
| - **sampling_method** (`str`) -- | |
| Sampling method (`auto`, `greedy`, `multinomial`). | |
| - **threshold** (`float`) -- | |
| Confidence threshold for committing tokens. | |
| - **editing_threshold** (`float`, *optional*) -- | |
| Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask | |
| tokens in a block are resolved, the pipeline continues refining: if the model predicts a different | |
| token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a | |
| negative value to disable editing. Defaults to `0.5`. | |
| - **max_post_steps** (`int`) -- | |
| Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only | |
| used when `editing_threshold` is enabled. Defaults to `16`. | |
| - **minimal_topk** (`int`) -- | |
| Minimum number of tokens to commit per step. | |
| - **eos_early_stop** (`bool`) -- | |
| Whether to stop after committing EOS in a block. | |
| - **eos_token_id** (`int`, *optional*) -- | |
| EOS token ID to use for early stopping. | |
| - **mask_token_id** (`int`, *optional*) -- | |
| Mask token ID to use for the template. | |
| - **generator** (`torch.Generator`, *optional*) -- | |
| RNG for sampling. | |
| - **output_type** (`str`, defaults to `"text"`) -- | |
| Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw | |
| token ID sequences only. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13486/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple. | |
| - **callback_on_step_end** (`Callable` or `PipelineCallback`, *optional*) -- | |
| Callback executed after each refinement step with signature `callback_on_step_end(self, step: int, | |
| timestep: int, callback_kwargs: Dict)`. | |
| - **callback_on_step_end_tensor_inputs** (`List[str]`, *optional*) -- | |
| Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`, | |
| `confidence`, `active_block`.0 | |
| Generate text with block-wise refinement. | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoModelForCausalLM, AutoTokenizer | |
| >>> from diffusers import BlockRefinementScheduler, LLaDA2Pipeline | |
| >>> model_id = "inclusionAI/LLaDA2.1-mini" | |
| >>> model = AutoModelForCausalLM.from_pretrained( | |
| ... model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto" | |
| ... ) | |
| >>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| >>> scheduler = BlockRefinementScheduler() | |
| >>> pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer) | |
| >>> output = pipe(prompt="What is the meaning of life?", gen_length=256) | |
| >>> print(output.texts[0]) | |
| ``` | |
| **Parameters:** | |
| prompt (`str` or `List[str]`, *optional*) : Prompt text. When `use_chat_template` is `True` (default) and a tokenizer with a chat template is available, the prompt is wrapped in a chat message before tokenization. | |
| messages (`List[Dict[str, str]]`, *optional*) : Chat messages to encode (e.g. `[{"role": "user", "content": "Hello"}]`). Takes precedence over `prompt` when provided. Requires a tokenizer with `apply_chat_template`. | |
| input_ids (`torch.LongTensor`, *optional*) : Pre-tokenized input IDs. Takes precedence over `prompt` and `messages`. | |
| use_chat_template (`bool`, defaults to `True`) : Whether to wrap the prompt in a chat template. | |
| add_generation_prompt (`bool`, defaults to `True`) : Whether to add the generation prompt when using chat templates. | |
| gen_length (`int`) : Number of tokens to generate. | |
| block_length (`int`) : Block size for refinement. | |
| num_inference_steps (`int`) : Number of refinement steps per block. | |
| temperature (`float`) : Sampling temperature. | |
| top_p (`float`, *optional*) : Nucleus sampling cutoff. | |
| top_k (`int`, *optional*) : Top-k sampling cutoff. | |
| sampling_method (`str`) : Sampling method (`auto`, `greedy`, `multinomial`). | |
| threshold (`float`) : Confidence threshold for committing tokens. | |
| editing_threshold (`float`, *optional*) : Confidence threshold for editing already-committed (non-mask) tokens. When positive, after all mask tokens in a block are resolved, the pipeline continues refining: if the model predicts a different token with confidence above this threshold, the existing token is replaced. Set to `None`, `0.0`, or a negative value to disable editing. Defaults to `0.5`. | |
| max_post_steps (`int`) : Maximum number of additional refinement iterations after all mask tokens in a block are resolved. Only used when `editing_threshold` is enabled. Defaults to `16`. | |
| minimal_topk (`int`) : Minimum number of tokens to commit per step. | |
| eos_early_stop (`bool`) : Whether to stop after committing EOS in a block. | |
| eos_token_id (`int`, *optional*) : EOS token ID to use for early stopping. | |
| mask_token_id (`int`, *optional*) : Mask token ID to use for the template. | |
| generator (`torch.Generator`, *optional*) : RNG for sampling. | |
| output_type (`str`, defaults to `"text"`) : Output format. `"text"` decodes sequences into strings (requires a tokenizer). `"seq"` returns raw token ID sequences only. | |
| return_dict (`bool`, *optional*, defaults to `True`) : Whether to return a [LLaDA2PipelineOutput](/docs/diffusers/pr_13486/en/api/pipelines/llada2#diffusers.LLaDA2PipelineOutput) instead of a tuple. | |
| callback_on_step_end (`Callable` or `PipelineCallback`, *optional*) : Callback executed after each refinement step with signature `callback_on_step_end(self, step: int, timestep: int, callback_kwargs: Dict)`. | |
| callback_on_step_end_tensor_inputs (`List[str]`, *optional*) : Tensor keys to pass to the callback. Allowed keys: `block_x`, `x0`, `x0_p`, `transfer_index`, `confidence`, `active_block`. | |
| ## LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]] | |
| #### diffusers.LLaDA2PipelineOutput[[diffusers.LLaDA2PipelineOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13486/src/diffusers/pipelines/llada2/pipeline_llada2.py#L54) | |
Xet Storage Details
- Size:
- 11.6 kB
- Xet hash:
- 3aeb08e7fda92d7ea51a24f1053336d2d145bffecbe54ea503f44f952fa3da55
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.