Buckets:

hf-doc-build/doc-dev / diffusers /pr_13751 /en /api /models /ideogram4_transformer2d.md
|
download
raw
4.93 kB

Ideogram4Transformer2DModel

A transformer for image-like data from Ideogram 4.

Ideogram4Transformer2DModel[[diffusers.Ideogram4Transformer2DModel]]

diffusers.Ideogram4Transformer2DModel[[diffusers.Ideogram4Transformer2DModel]]

Source

The flow-matching transformer backbone used by the Ideogram 4 pipeline.

The transformer operates on a single packed sequence containing both text-conditioning tokens (produced by a multimodal text encoder) and the patchified image latents. Per-token indicators distinguish the two roles, and a block-diagonal attention mask derived from segment_ids restricts each sample to attend only to itself within a packed batch.

forwarddiffusers.Ideogram4Transformer2DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/models/transformers/transformer_ideogram4.py#L373[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "position_ids", "val": ": Tensor"}, {"name": "segment_ids", "val": ": Tensor"}, {"name": "indicator", "val": ": Tensor"}, {"name": "attention_kwargs", "val": ": dict | None = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor of shape (batch_size, sequence_length, in_channels)) -- Packed sequence of patchified noisy image tokens. Non-image positions are masked out internally.

  • timestep (torch.Tensor of shape (batch_size,) or (batch_size, sequence_length)) -- Flow-matching time in [0, 1] (0 is pure noise, 1 is clean data).
  • encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, llm_features_dim)) -- Per-token text conditioning features. Non-text positions are masked out internally.
  • position_ids (torch.Tensor of shape (batch_size, sequence_length, 3)) -- (t, h, w) coordinates consumed by the multi-axis RoPE.
  • segment_ids (torch.Tensor of shape (batch_size, sequence_length)) -- Per-token sample id within a packed batch. Positions sharing a segment_id attend to each other.
  • indicator (torch.Tensor of shape (batch_size, sequence_length)) -- Per-token role: LLM_TOKEN_INDICATOR (text) or OUTPUT_IMAGE_INDICATOR (image).
  • attention_kwargs (dict, optional) -- A kwargs dictionary passed along to the attention processor. A "scale" entry scales the LoRA weights (when the PEFT backend is active).
  • return_dict (bool, optional, defaults to True) -- Whether to return a Transformer2DModelOutput instead of a plain tuple.0Transformer2DModelOutput or a tuple whose first element is a tensor of shape (batch_size, sequence_length, in_channels) in the model's compute dtype. Only positions tagged with OUTPUT_IMAGE_INDICATOR carry meaningful velocity predictions.

Predict the flow-matching velocity for the image-token positions of the packed sequence.

Parameters:

in_channels (int, defaults to 128) : Latent channel count after patchification (ae_channels * patch_size ** 2).

num_layers (int, defaults to 34) : Number of transformer blocks.

attention_head_dim (int, defaults to 256) : Dimension of each attention head; the total hidden size is attention_head_dim * num_attention_heads.

num_attention_heads (int, defaults to 18) : Number of attention heads.

intermediate_size (int, defaults to 12288) : Feed-forward hidden size used by the SwiGLU MLP inside each block.

adaln_dim (int, defaults to 512) : Dimensionality of the conditioning vector consumed by the AdaLN modulations.

llm_features_dim (int, defaults to 53248) : Dimensionality of the per-token text features fed into the model (typically a concatenation of hidden states from several layers of the text encoder).

rope_theta (int, defaults to 5_000_000) : Base used by the multi-axis rotary position embedding.

mrope_section (tuple[int, int, int], defaults to (24, 20, 20)) : Number of frequencies allocated to each of the (t, h, w) axes of MRoPE.

norm_eps (float, defaults to 1e-5) : Epsilon used by the RMSNorm modules inside the transformer blocks.

Returns:

Transformer2DModelOutput or a tuple whose first element is a tensor of shape (batch_size, sequence_length, in_channels) in the model's compute dtype. Only positions tagged with OUTPUT_IMAGE_INDICATOR carry meaningful velocity predictions.

Xet Storage Details

Size:
4.93 kB
·
Xet hash:
f411d29f4d289807eafb01dc2add24990203b54266695b39e0a1356a788613a9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.