Buckets:
| # DiffusionGemma | |
| DiffusionGemma is a block-diffusion encoder-decoder language model. A causal encoder reads the clean prompt (and any | |
| previously generated blocks) into a KV cache, and a bidirectional decoder denoises a fixed-size "canvas" of | |
| `canvas_length` tokens by cross-attending to that cache. Generation alternates an outer autoregressive loop over | |
| canvases with an inner denoising loop, where each step samples candidate tokens, commits the most confident ones via | |
| [BlockRefinementScheduler](/docs/diffusers/main/en/api/schedulers/block_refinement#diffusers.BlockRefinementScheduler) in uniform corruption mode, and renoises the rest. The model itself lives in | |
| `transformers` as `DiffusionGemmaForBlockDiffusion`; the released checkpoint is | |
| [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it). | |
| ## Usage | |
| ```py | |
| import torch | |
| from transformers import AutoProcessor, DiffusionGemmaForBlockDiffusion | |
| from diffusers import BlockRefinementScheduler, DiffusionGemmaPipeline | |
| model_id = "google/diffusiongemma-26B-A4B-it" | |
| model = DiffusionGemmaForBlockDiffusion.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto") | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| scheduler = BlockRefinementScheduler() | |
| pipe = DiffusionGemmaPipeline(model=model, scheduler=scheduler, processor=processor) | |
| pipe.model.model.decoder = torch.compile(pipe.model.model.decoder, mode="reduce-overhead", fullgraph=True) | |
| output = pipe( | |
| prompt="Why is the sky blue?", | |
| gen_length=256, | |
| num_inference_steps=48, | |
| cache_implementation="static", | |
| ) | |
| print(output.texts[0]) | |
| ``` | |
| `num_inference_steps` is the number of denoising steps per canvas (48 matches the released checkpoint); fewer steps are | |
| faster but lower quality. `cache_implementation="static"` lets the decoder be `torch.compile`-d with cudagraphs (see | |
| [Static cache and compilation](#static-cache-and-compilation)); drop both for a simpler dynamic-cache run. | |
| For multi-turn or multimodal inputs, pass a raw `messages` conversation instead of `prompt`. It is a list of | |
| `{"role", "content"}` dicts in the usual chat format, which the processor runs through its chat template: | |
| ```py | |
| messages = [ | |
| {"role": "user", "content": "Why is the sky blue?"}, | |
| ] | |
| # or with an image: | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": "Describe this image."}, | |
| ], | |
| }, | |
| ] | |
| output = pipe(messages=messages, gen_length=256) | |
| ``` | |
| For a single user turn you can skip `messages` and pass an `image` alongside the `prompt`; the processor turns it into | |
| the model's image inputs automatically. | |
| ## Schedulers | |
| The scheduler is the sampler that denoises each canvas, and it is interchangeable: swap it to change the sampling | |
| strategy without touching anything else. Three schedulers are available: | |
| - `BlockRefinementScheduler` (default): commits the most confident tokens each step (above `threshold`, plus an even | |
| per-step quota) and renoises the rest. `editing_threshold` additionally lets it re-edit already committed tokens. | |
| - `DiscreteDDIMScheduler`: samples each position from the exact discrete posterior of the uniform corruption process | |
| (D3PM). It is parameter free, and the final step deterministically commits the predicted tokens. | |
| - `EntropyBoundScheduler`: commits the lowest-entropy positions whose joint entropy stays under `entropy_bound`, so | |
| roughly independent tokens are accepted together. It anneals its sampling temperature from `t_max` (`0.8`) on the | |
| first step down to `t_min` (`0.4`) on the last, matching the released checkpoint's sampler. | |
| ```py | |
| from diffusers import DiscreteDDIMScheduler, EntropyBoundScheduler | |
| pipe.scheduler = DiscreteDDIMScheduler() | |
| # or: pipe.scheduler = EntropyBoundScheduler(entropy_bound=0.1) | |
| output = pipe(prompt="Why is the sky blue?", gen_length=256, num_inference_steps=48) | |
| print(output.texts[0]) | |
| ``` | |
| Scheduler-specific sampling knobs (the block-refinement `threshold`/`top_k`, the entropy bound, ...) are set on the | |
| scheduler config: | |
| ```py | |
| from diffusers import BlockRefinementScheduler | |
| pipe.scheduler = BlockRefinementScheduler.from_config(pipe.scheduler.config, threshold=0.9) | |
| ``` | |
| `EntropyBoundScheduler` anneals its sampling temperature (`t_max`/`t_min`) internally over the denoising steps; | |
| `DiscreteDDIMScheduler` and `BlockRefinementScheduler` use the flat `temperature` passed to the pipeline (`0.0` for | |
| greedy). | |
| ### Predictor-corrector sampling | |
| `DiscreteDDIMScheduler` supports the leave-one-out predictor-corrector of [Reparameterizing Uniform Diffusion Models](https://huggingface.co/papers/2605.22765). It refines the canvas with `corrector_steps` Gibbs sweeps that resample the least-confident positions from the one-coordinate conditional of the noisy marginal, which leaves that marginal invariant and improves generation at no extra training cost. It works directly on the released checkpoint: for uniform diffusion the denoiser and the leave-one-out posterior are interchangeable in closed form, so the corrector recovers the leave-one-out quantities it needs without any retraining. | |
| The corrector sweeps are folded into the `num_inference_steps` budget rather than added on top: the pipeline runs fewer predictor steps and spends the freed forwards on correctors, so the total number of model forwards stays `num_inference_steps` and the predictor-corrector costs the same as plain ancestral sampling. | |
| ```py | |
| from diffusers import DiscreteDDIMScheduler | |
| pipe.scheduler = DiscreteDDIMScheduler(corrector_steps=2, corrector_k=12) | |
| output = pipe(prompt="Why is the sky blue?", gen_length=256, num_inference_steps=48) | |
| print(output.texts[0]) | |
| ``` | |
| ## PEFT adapters | |
| The denoiser is a 🤗 Transformers model, so adapters are loaded through its native [PEFT](https://huggingface.co/docs/peft) integration rather than the diffusers `load_lora_weights` API. Because that integration is adapter-type-agnostic, the same calls load LoRA, DoRA, or any other PEFT adapter (e.g. the output of TRL's `SFTTrainer`). Manage adapters on the model component directly: | |
| ```py | |
| pipe.model.load_adapter("path/to/adapter", adapter_name="sft") # LoRA, DoRA, ... | |
| pipe.model.set_adapter("sft") | |
| output = pipe(prompt="Why is the sky blue?", gen_length=256) | |
| pipe.model.disable_adapters() # run the base model | |
| pipe.model.delete_adapter("sft") | |
| ``` | |
| Adapters stay active and unmerged: DiffusionGemma ties the encoder and decoder base weights, so fusing an adapter into them would corrupt both branches. | |
| ## Static cache and compilation | |
| The pipeline prefills the encoder once per block into a reusable cache (a `DynamicCache` by default). Passing | |
| `cache_implementation="static"` uses a fixed-shape `StaticCache` instead, whose shapes let you `torch.compile` the | |
| decoder with cudagraphs for a further speedup (the pipeline marks each step and clones the logits so cudagraph memory | |
| is not overwritten); this is the setup shown in [Usage](#usage). Drop both the `torch.compile` call and | |
| `cache_implementation="static"` for a simpler dynamic-cache run. | |
| ## Adaptive stopping | |
| A block usually converges before all `num_inference_steps` are spent, so by default the pipeline leaves a block's | |
| denoising loop early once every example's argmax prediction is stable for `stability_threshold` steps and the mean | |
| per-token entropy falls below `confidence_threshold` (`0.005`, the value used by the released checkpoint). This roughly | |
| halves the number of decoder forwards at matched quality and is the largest single throughput lever. Pass | |
| `confidence_threshold=None` to always run the full `num_inference_steps`: | |
| ```py | |
| output = pipe(prompt="Why is the sky blue?", gen_length=256, confidence_threshold=None) # disable adaptive stopping | |
| ``` | |
| ## Callbacks | |
| Callbacks run after each denoising step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are | |
| included in `callback_kwargs`; `canvas` (the current block tokens) and `logits` are available. Return `{"canvas": ...}` | |
| from the callback to replace the canvas. | |
| ```py | |
| def on_step_end(pipe, step, timestep, callback_kwargs): | |
| canvas = callback_kwargs["canvas"] | |
| # Inspect or modify `canvas` here. | |
| return {"canvas": canvas} | |
| out = pipe( | |
| prompt="Why is the sky blue?", | |
| callback_on_step_end=on_step_end, | |
| callback_on_step_end_tensor_inputs=["canvas"], | |
| ) | |
| ``` | |
| ## DiffusionGemmaPipeline[[diffusers.DiffusionGemmaPipeline]] | |
| #### diffusers.DiffusionGemmaPipeline[[diffusers.DiffusionGemmaPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/diffusion_gemma/pipeline_diffusion_gemma.py#L53) | |
| Pipeline for DiffusionGemma block-diffusion text generation. | |
| DiffusionGemma is a block-diffusion encoder-decoder model: a causal encoder reads the clean prompt (and any | |
| previously generated blocks) into a KV cache, and a bidirectional decoder denoises a fixed-size "canvas" of | |
| `canvas_length` tokens by cross-attending to that cache. Generation alternates an outer autoregressive loop over | |
| canvases with an inner denoising loop, where each step samples candidate tokens, commits the most confident ones | |
| via [BlockRefinementScheduler](/docs/diffusers/main/en/api/schedulers/block_refinement#diffusers.BlockRefinementScheduler) (uniform corruption mode, `mask_token_id=None`), and renoises the rest. | |
| The model is expected to be a `DiffusionGemmaForBlockDiffusion` instance exposing `forward(input_ids, | |
| decoder_input_ids=..., self_conditioning_logits=..., ...)` and returning logits of shape `[batch, canvas_length, | |
| vocab_size]` over the canvas. See the model card at https://huggingface.co/google/diffusiongemma-26B-A4B-it. | |
| __call__diffusers.DiffusionGemmaPipeline.__call__https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/diffusion_gemma/pipeline_diffusion_gemma.py#L163[{"name": "prompt", "val": ": str | list[str] | None = None"}, {"name": "messages", "val": ": list[dict] | None = None"}, {"name": "image", "val": ": Any | list[Any] | None = None"}, {"name": "add_generation_prompt", "val": ": bool = True"}, {"name": "gen_length", "val": ": int = 256"}, {"name": "num_inference_steps", "val": ": int = 48"}, {"name": "temperature", "val": ": float = 0.0"}, {"name": "cache_implementation", "val": ": str | None = None"}, {"name": "eos_early_stop", "val": ": bool = True"}, {"name": "eos_token_id", "val": ": int | None = None"}, {"name": "stability_threshold", "val": ": int = 1"}, {"name": "confidence_threshold", "val": ": float | None = 0.005"}, {"name": "generator", "val": ": torch.Generator | None = None"}, {"name": "output_type", "val": ": str = 'text'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": Callable[[Any, int, int, dict], dict] | PipelineCallback | MultiPipelineCallbacks | None = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list[str] | None = None"}]- **prompt** (`str` or `List[str]`, *optional*) -- | |
| Prompt text, wrapped in a chat template and tokenized by the processor. Provide either this or | |
| `messages`. | |
| - **messages** (`List[Dict]`, *optional*) -- | |
| A raw chat conversation to encode, e.g. `[{"role": "user", "content": "Hello"}]` or a multi-turn / | |
| multimodal conversation. Use this instead of `prompt` for anything beyond a single user turn. | |
| - **image** (`PIL.Image.Image` or `List`, *optional*) -- | |
| Image(s) to pair with `prompt` for multimodal generation; the processor turns them into the model's | |
| image inputs. For richer layouts, put the image content directly in `messages`. | |
| - **add_generation_prompt** (`bool`, defaults to `True`) -- | |
| Whether to add the generation prompt when applying the chat template. | |
| - **gen_length** (`int`, defaults to `256`) -- | |
| Number of tokens to generate, rounded up to a multiple of the model's `canvas_length`. | |
| - **num_inference_steps** (`int`, defaults to `48`) -- | |
| Number of denoising steps per canvas. | |
| - **temperature** (`float`, defaults to `0.0`) -- | |
| Sampling temperature for `DiscreteDDIMScheduler`/`BlockRefinementScheduler` (`0.0` is greedy); | |
| `EntropyBoundScheduler` ignores it and anneals its own temperature. Other sampling knobs (e.g. `top_k`, | |
| `threshold`, `t_min`/`t_max`) are scheduler config; set them on the scheduler, e.g. `pipe.scheduler = | |
| BlockRefinementScheduler.from_config(pipe.scheduler.config, top_k=...)`. | |
| - **cache_implementation** (`str`, *optional*) -- | |
| Set to `"static"` to prefill the encoder once per block into a persistent `StaticCache` and run the | |
| decoder against it with fixed shapes, instead of re-encoding the full sequence on every step. The fixed | |
| shapes also let you compile the decoder, e.g. `pipe.model.model.decoder = | |
| torch.compile(pipe.model.model.decoder, fullgraph=True)`. | |
| - **eos_early_stop** (`bool`, defaults to `True`) -- | |
| Whether to stop generating further canvases once every sequence has emitted EOS. | |
| - **eos_token_id** (`int`, *optional*) -- | |
| EOS token ID for early stopping. Falls back to the processor's tokenizer. | |
| - **stability_threshold** (`int`, defaults to `1`) -- | |
| Number of consecutive steps the argmax prediction must be unchanged for a block to count as stable. | |
| Only used when `confidence_threshold` is set. | |
| - **confidence_threshold** (`float`, *optional*, defaults to `0.005`) -- | |
| Leave a block's denoising loop early once every example is stable (see `stability_threshold`) and the | |
| mean per-token entropy of the prediction is below this value. Speeds up generation at matched quality; | |
| the default matches the released checkpoint. Set to `None` to always run all `num_inference_steps`. | |
| - **generator** (`torch.Generator`, *optional*) -- | |
| RNG for sampling. | |
| - **output_type** (`str`, defaults to `"text"`) -- | |
| `"text"` decodes sequences into strings (requires a processor); `"seq"` returns token IDs only. | |
| - **return_dict** (`bool`, defaults to `True`) -- | |
| Whether to return a [DiffusionGemmaPipelineOutput](/docs/diffusers/main/en/api/pipelines/diffusion_gemma#diffusers.DiffusionGemmaPipelineOutput) instead of a tuple. | |
| - **callback_on_step_end** (`Callable` or `PipelineCallback`, *optional*) -- | |
| Callback run after each denoising step with signature `callback_on_step_end(self, step, timestep, | |
| callback_kwargs)`. Allowed tensor keys: `canvas`, `logits`. | |
| - **callback_on_step_end_tensor_inputs** (`List[str]`, *optional*) -- | |
| Tensor keys to pass to the callback.0[DiffusionGemmaPipelineOutput](/docs/diffusers/main/en/api/pipelines/diffusion_gemma#diffusers.DiffusionGemmaPipelineOutput) or `tuple`The generated token IDs (`sequences`) and, for `output_type="text"`, the decoded `texts`. | |
| Generate text with block diffusion. | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoProcessor, DiffusionGemmaForBlockDiffusion | |
| >>> from diffusers import BlockRefinementScheduler, DiffusionGemmaPipeline | |
| >>> model_id = "google/diffusiongemma-26B-A4B-it" | |
| >>> model = DiffusionGemmaForBlockDiffusion.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto") | |
| >>> processor = AutoProcessor.from_pretrained(model_id) | |
| >>> scheduler = BlockRefinementScheduler() | |
| >>> pipe = DiffusionGemmaPipeline(model=model, scheduler=scheduler, processor=processor) | |
| >>> output = pipe(prompt="Why is the sky blue?", gen_length=256) | |
| >>> print(output.texts[0]) | |
| ``` | |
| **Parameters:** | |
| model ([DiffusionGemmaForBlockDiffusion](https://huggingface.co/docs/transformers/main/en/model_doc/diffusion_gemma#transformers.DiffusionGemmaForBlockDiffusion)) : The block-diffusion denoiser (causal encoder + bidirectional decoder with tied weights). | |
| scheduler ([BlockRefinementScheduler](/docs/diffusers/main/en/api/schedulers/block_refinement#diffusers.BlockRefinementScheduler), `DiscreteDDIMScheduler` or `EntropyBoundScheduler`) : The sampler that commits and renoises canvas tokens each denoising step. | |
| processor ([ProcessorMixin](https://huggingface.co/docs/transformers/main/en/main_classes/processors#transformers.ProcessorMixin)) : The processor used to apply the chat template and decode the generated tokens. | |
| **Returns:** | |
| `[DiffusionGemmaPipelineOutput](/docs/diffusers/main/en/api/pipelines/diffusion_gemma#diffusers.DiffusionGemmaPipelineOutput) or `tuple`` | |
| The generated token IDs (`sequences`) and, for `output_type="text"`, the decoded `texts`. | |
| ## DiffusionGemmaPipelineOutput[[diffusers.DiffusionGemmaPipelineOutput]] | |
| #### diffusers.DiffusionGemmaPipelineOutput[[diffusers.DiffusionGemmaPipelineOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/diffusion_gemma/pipeline_output.py#L25) | |
| Output class for DiffusionGemma block-diffusion generation. | |
| **Parameters:** | |
| sequences (`torch.LongTensor` of shape `(batch_size, gen_length)`) : The generated token IDs (the prompt is stripped off). | |
| texts (`list[str]`, *optional*) : The decoded text, one string per sequence. Only set for `output_type="text"`. | |
Xet Storage Details
- Size:
- 16.9 kB
- Xet hash:
- c14d763f392e81e107e3a5627566e8605c4d46e1fa751fa27550bd4a829a2eb4
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.