Buckets:

rtrm's picture
|
download
raw
10.2 kB
# Flux2
Flux.2 is the recent series of image generation models from Black Forest Labs, preceded by the [Flux.1](./flux) series. It is an entirely new model with a new architecture and pre-training done from scratch!
Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux2).
> [!TIP]
> Flux2 can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more.
>
> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.
## Caption upsampling
Flux.2 can potentially generate better better outputs with better prompts. We can "upsample"
an input prompt by setting the `caption_upsample_temperature` argument in the pipeline call arguments.
The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L140) recommends this value to be 0.15.
## Flux2Pipeline[[diffusers.Flux2Pipeline]]
#### diffusers.Flux2Pipeline[[diffusers.Flux2Pipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/flux2/pipeline_flux2.py#L251)
The Flux2 pipeline for text-to-image generation.
Reference: [https://bfl.ai/blog/flux-2](https://bfl.ai/blog/flux-2)
__call__diffusers.Flux2Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/flux2/pipeline_flux2.py#L743[{"name": "image", "val": ": typing.Union[typing.List[PIL.Image.Image], PIL.Image.Image, NoneType] = None"}, {"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 4.0"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "text_encoder_out_layers", "val": ": typing.Tuple[int] = (10, 20, 30)"}, {"name": "caption_upsample_temperature", "val": ": float = None"}]- **image** (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`) --
`Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
latents as `image`, but if passing latents directly it is not encoded again.
- **prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
instead.
- **guidance_scale** (`float`, *optional*, defaults to 1.0) --
Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
a model to generate images more aligned with `prompt` at the expense of lower image quality.
Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
the [paper](https://huggingface.co/papers/2210.03142) to learn more.
- **height** (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor) --
The height in pixels of the generated image. This is set to 1024 by default for the best results.
- **width** (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor) --
The width in pixels of the generated image. This is set to 1024 by default for the best results.
- **num_inference_steps** (`int`, *optional*, defaults to 50) --
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
- **sigmas** (`List[float]`, *optional*) --
Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
will be used.
- **num_images_per_prompt** (`int`, *optional*, defaults to 1) --
The number of images to generate per prompt.
- **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) --
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
- **latents** (`torch.Tensor`, *optional*) --
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by sampling using the supplied random `generator`.
- **prompt_embeds** (`torch.Tensor`, *optional*) --
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
- **output_type** (`str`, *optional*, defaults to `"pil"`) --
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
Whether or not to return a `~pipelines.qwenimage.QwenImagePipelineOutput` instead of a plain tuple.
- **attention_kwargs** (`dict`, *optional*) --
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
- **callback_on_step_end** (`Callable`, *optional*) --
A function that calls at the end of each denoising steps during the inference. The function is called
with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
`callback_on_step_end_tensor_inputs`.
- **callback_on_step_end_tensor_inputs** (`List`, *optional*) --
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
- **max_sequence_length** (`int` defaults to 512) -- Maximum sequence length to use with the `prompt`.
- **text_encoder_out_layers** (`Tuple[int]`) --
Layer indices to use in the `text_encoder` to derive the final prompt embeddings.
- **caption_upsample_temperature** (`float`) --
When specified, we will try to perform caption upsampling for potentially improved outputs. We
recommend setting it to 0.15 if caption upsampling is to be performed.0`~pipelines.flux2.Flux2PipelineOutput` or `tuple``~pipelines.flux2.Flux2PipelineOutput` if
`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
generated images.
Function invoked when calling the pipeline for generation.
Examples:
```py
>>> import torch
>>> from diffusers import Flux2Pipeline
>>> pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> # Depending on the variant being used, the pipeline call will slightly vary.
>>> # Refer to the pipeline documentation for more details.
>>> image = pipe(prompt, num_inference_steps=50, guidance_scale=2.5).images[0]
>>> image.save("flux.png")
```
**Parameters:**
transformer ([Flux2Transformer2DModel](/docs/diffusers/pr_11739/en/api/models/flux2_transformer#diffusers.Flux2Transformer2DModel)) : Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_11739/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) : A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae (`AutoencoderKLFlux2`) : Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (`Mistral3ForConditionalGeneration`) : [Mistral3ForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/mistral3#transformers.Mistral3ForConditionalGeneration)
tokenizer (`AutoProcessor`) : Tokenizer of class [PixtralProcessor](https://huggingface.co/docs/transformers/en/model_doc/pixtral#transformers.PixtralProcessor).
**Returns:**
``~pipelines.flux2.Flux2PipelineOutput` or `tuple``
`~pipelines.flux2.Flux2PipelineOutput` if
`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
generated images.

Xet Storage Details

Size:
10.2 kB
·
Xet hash:
1ec27870d23cb3c8e428c97dcbad539f55bcbb865de14ca570bfe46513a7d863

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.