Buckets:
Ideogram 4
Ideogram 4 is a flow-matching text-to-image model that uses a multimodal text encoder and an asymmetric
classifier-free guidance scheme: a dedicated unconditional_transformer produces the negative branch with zeroed text
features, while the main transformer consumes the full packed text + image sequence.
The pipeline defaults are the recommended settings for best quality, so a plain pipe(prompt) call produces
best-quality results out of the box: 48 flow-matching steps on a logit-normal schedule (mu=0.0, std=1.5) with
classifier-free guidance held at 7.0 for the main steps and dropped to 3.0 for the final 3 "polish" steps.
Key inference-time knobs are exposed via the pipeline call:
num_inference_steps,mu, andstdcontrol the resolution-aware logit-normal flow-matching schedule.guidance_scale(or a full per-stepguidance_schedule) blends the conditional and unconditional velocities.
Text-to-image
import torch
from diffusers import Ideogram4Pipeline
pipe = Ideogram4Pipeline.from_pretrained("ideogram-ai/ideogram-v4", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A photo of a cat holding a sign that says hello world"
# The defaults are the recommended settings for best quality.
image = pipe(prompt, height=1024, width=1024, generator=torch.Generator("cuda").manual_seed(0)).images[0]
image.save("ideogram4.png")
Ideogram4Pipeline[[diffusers.Ideogram4Pipeline]]
diffusers.Ideogram4Pipeline[[diffusers.Ideogram4Pipeline]]
Text-to-image pipeline for Ideogram4.
Ideogram4 is a flow-matching model trained with asymmetric classifier-free guidance: a transformer consumes
text-conditioned features alongside the image latents, while a separate unconditional_transformer denoises with
zeroed text features. The two velocity predictions are linearly blended each step.
__call__diffusers.Ideogram4Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13863/src/diffusers/pipelines/ideogram4/pipeline_ideogram4.py#L407[{"name": "prompt", "val": ": str | list[str] | None = None"}, {"name": "height", "val": ": int = 2048"}, {"name": "width", "val": ": int = 2048"}, {"name": "num_inference_steps", "val": ": int = 48"}, {"name": "guidance_scale", "val": ": float | None = None"}, {"name": "guidance_schedule", "val": ": list[float] | torch.Tensor | None = (7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0)"}, {"name": "mu", "val": ": float = 0.0"}, {"name": "std", "val": ": float = 1.5"}, {"name": "max_sequence_length", "val": ": int = 2048"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "latents", "val": ": torch.Tensor | None = None"}, {"name": "output_type", "val": ": str = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[ForwardRef('Ideogram4Pipeline'), int, int, dict[str, typing.Any]], dict[str, typing.Any]]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": list = ['latents']"}]- prompt (str or list[str]) --
Prompt(s) to guide image generation.
- height (
int, optional, defaults to 2048) -- Output image height in pixels; must be a multiple ofvae_scale_factor * patch_size. - width (
int, optional, defaults to 2048) -- Output image width in pixels; must be a multiple ofvae_scale_factor * patch_size. - num_inference_steps (
int, optional, defaults to 48) -- Number of flow-matching steps. The default is the recommended setting for best quality. - guidance_scale (
float, optional) -- Constant classifier-free guidance scale applied at every step. The conditional and unconditional velocity predictions are blended asv = guidance_scale * v_pos + (1 - guidance_scale) * v_neg. Mutually exclusive withguidance_schedule(setting both raises). Defaults toNone. - guidance_schedule (
list[float]ortorch.Tensor, optional) -- Per-step guidance scale schedule; must have lengthnum_inference_steps. The first entry corresponds to the first step (largest noise level). Mutually exclusive withguidance_scale; exactly one must be set. Defaults to the recommended schedule (7.0 for the main steps, dropping to 3.0 for the final 3 "polish" steps). To use a constant scale instead, passguidance_scaleandguidance_schedule=None. - mu (
float, optional, defaults to 0.0) -- Base mean of the logit-normal flow-matching schedule. The schedule mean is shifted by half the log of the resolution ratio relative to 512x512. - std (
float, optional, defaults to 1.5) -- Standard deviation of the logit-normal flow-matching schedule. - max_sequence_length (
int, optional, defaults to 2048) -- Maximum number of text tokens per prompt. - num_images_per_prompt (
int, optional, defaults to 1) -- Number of images to generate per prompt. - generator (
torch.Generatororlist[torch.Generator], optional) -- Generator(s) used to make sampling deterministic. - latents (
torch.Tensor, optional) -- Pre-generated noise of shape(batch_size, num_image_tokens, latent_dim). - output_type (
str, optional, defaults to"pil") -- One of"pil","np","pt", or"latent". - return_dict (
bool, optional, defaults toTrue) -- Whether to return an Ideogram4PipelineOutput. - callback_on_step_end (
Callable, optional) -- Callback invoked at the end of every denoising step. - callback_on_step_end_tensor_inputs (
list[str], optional) -- Names of tensors to expose to the callback viacallback_kwargs.0Ideogram4PipelineOutput ortuple.
Run text-to-image generation.
Examples:
>>> import torch
>>> from diffusers import Ideogram4Pipeline
>>> pipe = Ideogram4Pipeline.from_pretrained("ideogram-ai/ideogram-v4", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A photo of a cat holding a sign that says hello world"
>>> # The defaults are the recommended settings for best quality.
>>> image = pipe(prompt, height=2048, width=2048, generator=torch.Generator("cuda").manual_seed(0)).images[0]
>>> image.save("ideogram4.png")
Parameters:
scheduler (FlowMatchEulerDiscreteScheduler) : Flow-matching scheduler. The pipeline overrides the default sigma schedule with a resolution-aware logit-normal schedule.
vae (AutoencoderKLFlux2) : Variational auto-encoder used to decode latents back into images.
text_encoder (PreTrainedModel) : Multimodal text encoder. The pipeline consumes hidden states from a fixed set of intermediate decoder layers (see QWEN3_VL_ACTIVATION_LAYERS).
tokenizer (AutoTokenizer) : Tokenizer paired with text_encoder.
transformer (Ideogram4Transformer2DModel) : Conditional flow-matching transformer.
unconditional_transformer (Ideogram4Transformer2DModel) : Unconditional (asymmetric-CFG) flow-matching transformer.
Returns:
Ideogram4PipelineOutput or tuple.
encode_prompt[[diffusers.Ideogram4Pipeline.encode_prompt]]
Prepare the conditioning for the packed text+image sequence (one entry per prompt).
Returns a flat tuple (prompt_embeds, position_ids, segment_ids, indicator). The unconditional branch carries
no text, so the pipeline builds its (zeroed) inputs directly rather than encoding a negative prompt.
Ideogram4PipelineOutput[[diffusers.pipelines.ideogram4.Ideogram4PipelineOutput]]
diffusers.pipelines.ideogram4.Ideogram4PipelineOutput[[diffusers.pipelines.ideogram4.Ideogram4PipelineOutput]]
Output class for the Ideogram 4 pipeline.
Parameters:
images (list[PIL.Image.Image] or np.ndarray) : List of denoised PIL images of length batch_size, or numpy array of shape (batch_size, height, width, num_channels).
Xet Storage Details
- Size:
- 9.05 kB
- Xet hash:
- 4f0799f5b2d3bae6a9f189077df19356b26210f9d953d939d8da95e3d2663595
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.