Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_11739 /en /api /pipelines /flux2.md

rtrm

30 days ago

preview code

download

raw

10.2 kB

	# Flux2




	Flux.2 is the recent series of image generation models from Black Forest Labs, preceded by the [Flux.1](./flux) series. It is an entirely new model with a new architecture and pre-training done from scratch!

	Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux2).

	> [!TIP]
	> Flux2 can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more.
	>
	> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.

	## Caption upsampling

	Flux.2 can potentially generate better better outputs with better prompts. We can "upsample"
	an input prompt by setting the `caption_upsample_temperature` argument in the pipeline call arguments.
	The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L140) recommends this value to be 0.15.

	## Flux2Pipeline[[diffusers.Flux2Pipeline]]

	#### diffusers.Flux2Pipeline[[diffusers.Flux2Pipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/flux2/pipeline_flux2.py#L251)

	The Flux2 pipeline for text-to-image generation.

	Reference: [https://bfl.ai/blog/flux-2](https://bfl.ai/blog/flux-2)

	__call__diffusers.Flux2Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/flux2/pipeline_flux2.py#L743[{"name": "image", "val": ": typing.Union[typing.List[PIL.Image.Image], PIL.Image.Image, NoneType] = None"}, {"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 4.0"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "text_encoder_out_layers", "val": ": typing.Tuple[int] = (10, 20, 30)"}, {"name": "caption_upsample_temperature", "val": ": float = None"}]- image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`) --
	`Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
	numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
	or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
	list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
	latents as `image`, but if passing latents directly it is not encoded again.
	- prompt (`str` or `List[str]`, optional) --
	The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
	instead.
	- guidance_scale (`float`, optional, defaults to 1.0) --
	Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
	a model to generate images more aligned with `prompt` at the expense of lower image quality.

	Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
	the [paper](https://huggingface.co/papers/2210.03142) to learn more.
	- height (`int`, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) --
	The height in pixels of the generated image. This is set to 1024 by default for the best results.
	- width (`int`, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) --
	The width in pixels of the generated image. This is set to 1024 by default for the best results.
	- num_inference_steps (`int`, optional, defaults to 50) --
	The number of denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference.
	- sigmas (`List[float]`, optional) --
	Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
	their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
	will be used.
	- num_images_per_prompt (`int`, optional, defaults to 1) --
	The number of images to generate per prompt.
	- generator (`torch.Generator` or `List[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by sampling using the supplied random `generator`.
	- prompt_embeds (`torch.Tensor`, optional) --
	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
	provided, text embeddings will be generated from `prompt` input argument.
	- output_type (`str`, optional, defaults to `"pil"`) --
	The output format of the generate image. Choose between
	[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether or not to return a `~pipelines.qwenimage.QwenImagePipelineOutput` instead of a plain tuple.
	- attention_kwargs (`dict`, optional) --
	A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
	`self.processor` in
	[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
	- callback_on_step_end (`Callable`, optional) --
	A function that calls at the end of each denoising steps during the inference. The function is called
	with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
	callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
	`callback_on_step_end_tensor_inputs`.
	- callback_on_step_end_tensor_inputs (`List`, optional) --
	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
	will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
	`._callback_tensor_inputs` attribute of your pipeline class.
	- max_sequence_length (`int` defaults to 512) -- Maximum sequence length to use with the `prompt`.
	- text_encoder_out_layers (`Tuple[int]`) --
	Layer indices to use in the `text_encoder` to derive the final prompt embeddings.
	- caption_upsample_temperature (`float`) --
	When specified, we will try to perform caption upsampling for potentially improved outputs. We
	recommend setting it to 0.15 if caption upsampling is to be performed.0`~pipelines.flux2.Flux2PipelineOutput` or `tuple``~pipelines.flux2.Flux2PipelineOutput` if
	`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
	generated images.

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> import torch
	>>> from diffusers import Flux2Pipeline

	>>> pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
	>>> pipe.to("cuda")
	>>> prompt = "A cat holding a sign that says hello world"
	>>> # Depending on the variant being used, the pipeline call will slightly vary.
	>>> # Refer to the pipeline documentation for more details.
	>>> image = pipe(prompt, num_inference_steps=50, guidance_scale=2.5).images[0]
	>>> image.save("flux.png")
	```

	Parameters:

	transformer ([Flux2Transformer2DModel](/docs/diffusers/pr_11739/en/api/models/flux2_transformer#diffusers.Flux2Transformer2DModel)) : Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.

	scheduler ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_11739/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) : A scheduler to be used in combination with `transformer` to denoise the encoded image latents.

	vae (`AutoencoderKLFlux2`) : Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.

	text_encoder (`Mistral3ForConditionalGeneration`) : [Mistral3ForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/mistral3#transformers.Mistral3ForConditionalGeneration)

	tokenizer (`AutoProcessor`) : Tokenizer of class [PixtralProcessor](https://huggingface.co/docs/transformers/en/model_doc/pixtral#transformers.PixtralProcessor).

	Returns:

	``~pipelines.flux2.Flux2PipelineOutput` or `tuple``

	`~pipelines.flux2.Flux2PipelineOutput` if
	`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
	generated images.

Xet Storage Details

Size:: 10.2 kB
Xet hash:: 1ec27870d23cb3c8e428c97dcbad539f55bcbb865de14ca570bfe46513a7d863

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.