Buckets:
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. --> | |
| <div style="float: right;"> | |
| <div class="flex flex-wrap space-x-1"> | |
| <a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener"> | |
| <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/> | |
| </a> | |
| <img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22"> | |
| </div> | |
| </div> | |
| # LTX-Video | |
| [LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step. | |
| You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization. | |
| > [!TIP] | |
| > Click on the LTX-Video models in the right sidebar for more examples of other video generation tasks. | |
| The example below demonstrates how to generate a video optimized for memory or inference speed. | |
| <hfoptions id="usage"> | |
| <hfoption id="memory"> | |
| Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. | |
| The LTX-Video model below requires ~10GB of VRAM. | |
| ```py | |
| import torch | |
| from diffusers import LTXPipeline, AutoModel | |
| from diffusers.hooks import apply_group_offloading | |
| from diffusers.utils import export_to_video | |
| # fp8 layerwise weight-casting | |
| transformer = AutoModel.from_pretrained( | |
| "Lightricks/LTX-Video", | |
| subfolder="transformer", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| transformer.enable_layerwise_casting( | |
| storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16 | |
| ) | |
| pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16) | |
| # group-offloading | |
| onload_device = torch.device("cuda") | |
| offload_device = torch.device("cpu") | |
| pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True) | |
| apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2) | |
| apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level") | |
| prompt = """ | |
| A woman with long brown hair and light skin smiles at another woman with long blonde hair. | |
| The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. | |
| The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and | |
| natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage | |
| """ | |
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| video = pipeline( | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=768, | |
| height=512, | |
| num_frames=161, | |
| decode_timestep=0.03, | |
| decode_noise_scale=0.025, | |
| num_inference_steps=50, | |
| ).frames[0] | |
| export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </hfoption> | |
| <hfoption id="inference speed"> | |
| [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. | |
| ```py | |
| import torch | |
| from diffusers import LTXPipeline | |
| from diffusers.utils import export_to_video | |
| pipeline = LTXPipeline.from_pretrained( | |
| "Lightricks/LTX-Video", torch_dtype=torch.bfloat16 | |
| ) | |
| # torch.compile | |
| pipeline.transformer.to(memory_format=torch.channels_last) | |
| pipeline.transformer = torch.compile( | |
| pipeline.transformer, mode="max-autotune", fullgraph=True | |
| ) | |
| prompt = """ | |
| A woman with long brown hair and light skin smiles at another woman with long blonde hair. | |
| The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. | |
| The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and | |
| natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage | |
| """ | |
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| video = pipeline( | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=768, | |
| height=512, | |
| num_frames=161, | |
| decode_timestep=0.03, | |
| decode_noise_scale=0.025, | |
| num_inference_steps=50, | |
| ).frames[0] | |
| export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ## Notes | |
| - Refer to the following recommended settings for generation from the [LTX-Video](https://github.com/Lightricks/LTX-Video) repository. | |
| - The recommended dtype for the transformer, VAE, and text encoder is `torch.bfloat16`. The VAE and text encoder can also be `torch.float32` or `torch.float16`. | |
| - For guidance-distilled variants of LTX-Video, set `guidance_scale` to `1.0`. The `guidance_scale` for any other model should be set higher, like `5.0`, for good generation quality. | |
| - For timestep-aware VAE variants (LTX-Video 0.9.1 and above), set `decode_timestep` to `0.05` and `image_cond_noise_scale` to `0.025`. | |
| - For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitionts in the generated video. | |
| - LTX-Video 0.9.7 includes a spatial latent upscaler and a 13B parameter transformer. During inference, a low resolution video is quickly generated first and then upscaled and refined. | |
| <details> | |
| <summary>Show example code</summary> | |
| ```py | |
| import torch | |
| from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline | |
| from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition | |
| from diffusers.utils import export_to_video, load_video | |
| pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16) | |
| pipeline_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) | |
| pipeline.to("cuda") | |
| pipe_upsample.to("cuda") | |
| pipeline.vae.enable_tiling() | |
| def round_to_nearest_resolution_acceptable_by_vae(height, width): | |
| height = height - (height % pipeline.vae_temporal_compression_ratio) | |
| width = width - (width % pipeline.vae_temporal_compression_ratio) | |
| return height, width | |
| video = load_video( | |
| "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" | |
| )[:21] # only use the first 21 frames as conditioning | |
| condition1 = LTXVideoCondition(video=video, frame_index=0) | |
| prompt = """ | |
| The video depicts a winding mountain road covered in snow, with a single vehicle | |
| traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. | |
| The landscape is characterized by rugged terrain and a river visible in the distance. | |
| The scene captures the solitude and beauty of a winter drive through a mountainous region. | |
| """ | |
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| expected_height, expected_width = 768, 1152 | |
| downscale_factor = 2 / 3 | |
| num_frames = 161 | |
| # 1. Generate video at smaller resolution | |
| # Text-only conditioning is also supported without the need to pass `conditions` | |
| downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) | |
| downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) | |
| latents = pipeline( | |
| conditions=[condition1], | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=downscaled_width, | |
| height=downscaled_height, | |
| num_frames=num_frames, | |
| num_inference_steps=30, | |
| decode_timestep=0.05, | |
| decode_noise_scale=0.025, | |
| image_cond_noise_scale=0.0, | |
| guidance_scale=5.0, | |
| guidance_rescale=0.7, | |
| generator=torch.Generator().manual_seed(0), | |
| output_type="latent", | |
| ).frames | |
| # 2. Upscale generated video using latent upsampler with fewer inference steps | |
| # The available latent upsampler upscales the height/width by 2x | |
| upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 | |
| upscaled_latents = pipe_upsample( | |
| latents=latents, | |
| output_type="latent" | |
| ).frames | |
| # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) | |
| video = pipeline( | |
| conditions=[condition1], | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=upscaled_width, | |
| height=upscaled_height, | |
| num_frames=num_frames, | |
| denoise_strength=0.4, # Effectively, 4 inference steps out of 10 | |
| num_inference_steps=10, | |
| latents=upscaled_latents, | |
| decode_timestep=0.05, | |
| decode_noise_scale=0.025, | |
| image_cond_noise_scale=0.0, | |
| guidance_scale=5.0, | |
| guidance_rescale=0.7, | |
| generator=torch.Generator().manual_seed(0), | |
| output_type="pil", | |
| ).frames[0] | |
| # 4. Downscale the video to the expected resolution | |
| video = [frame.resize((expected_width, expected_height)) for frame in video] | |
| export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </details> | |
| - LTX-Video 0.9.7 distilled model is guidance and timestep-distilled to speedup generation. It requires `guidance_scale` to be set to `1.0` and `num_inference_steps` should be set between `4` and `10` for good generation quality. You should also use the following custom timesteps for the best results. | |
| - Base model inference to prepare for upscaling: `[1000, 993, 987, 981, 975, 909, 725, 0.03]`. | |
| - Upscaling: `[1000, 909, 725, 421, 0]`. | |
| <details> | |
| <summary>Show example code</summary> | |
| ```py | |
| import torch | |
| from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline | |
| from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition | |
| from diffusers.utils import export_to_video, load_video | |
| pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16) | |
| pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) | |
| pipeline.to("cuda") | |
| pipe_upsample.to("cuda") | |
| pipeline.vae.enable_tiling() | |
| def round_to_nearest_resolution_acceptable_by_vae(height, width): | |
| height = height - (height % pipeline.vae_temporal_compression_ratio) | |
| width = width - (width % pipeline.vae_temporal_compression_ratio) | |
| return height, width | |
| prompt = """ | |
| artistic anatomical 3d render, utlra quality, human half full male body with transparent | |
| skin revealing structure instead of organs, muscular, intricate creative patterns, | |
| monochromatic with backlighting, lightning mesh, scientific concept art, blending biology | |
| with botany, surreal and ethereal quality, unreal engine 5, ray tracing, ultra realistic, | |
| 16K UHD, rich details. camera zooms out in a rotating fashion | |
| """ | |
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| expected_height, expected_width = 768, 1152 | |
| downscale_factor = 2 / 3 | |
| num_frames = 161 | |
| # 1. Generate video at smaller resolution | |
| downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) | |
| downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) | |
| latents = pipeline( | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=downscaled_width, | |
| height=downscaled_height, | |
| num_frames=num_frames, | |
| timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03], | |
| decode_timestep=0.05, | |
| decode_noise_scale=0.025, | |
| image_cond_noise_scale=0.0, | |
| guidance_scale=1.0, | |
| guidance_rescale=0.7, | |
| generator=torch.Generator().manual_seed(0), | |
| output_type="latent", | |
| ).frames | |
| # 2. Upscale generated video using latent upsampler with fewer inference steps | |
| # The available latent upsampler upscales the height/width by 2x | |
| upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 | |
| upscaled_latents = pipe_upsample( | |
| latents=latents, | |
| adain_factor=1.0, | |
| output_type="latent" | |
| ).frames | |
| # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) | |
| video = pipeline( | |
| prompt=prompt, | |
| negative_prompt=negative_prompt, | |
| width=upscaled_width, | |
| height=upscaled_height, | |
| num_frames=num_frames, | |
| denoise_strength=0.999, # Effectively, 4 inference steps out of 5 | |
| timesteps=[1000, 909, 725, 421, 0], | |
| latents=upscaled_latents, | |
| decode_timestep=0.05, | |
| decode_noise_scale=0.025, | |
| image_cond_noise_scale=0.0, | |
| guidance_scale=1.0, | |
| guidance_rescale=0.7, | |
| generator=torch.Generator().manual_seed(0), | |
| output_type="pil", | |
| ).frames[0] | |
| # 4. Downscale the video to the expected resolution | |
| video = [frame.resize((expected_width, expected_height)) for frame in video] | |
| export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </details> | |
| - LTX-Video supports LoRAs with [load_lora_weights()](/docs/diffusers/pr_12229/en/api/loaders/lora#diffusers.loaders.LTXVideoLoraLoaderMixin.load_lora_weights). | |
| <details> | |
| <summary>Show example code</summary> | |
| ```py | |
| import torch | |
| from diffusers import LTXConditionPipeline | |
| from diffusers.utils import export_to_video, load_image | |
| pipeline = LTXConditionPipeline.from_pretrained( | |
| "Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16 | |
| ) | |
| pipeline.load_lora_weights("Lightricks/LTX-Video-Cakeify-LoRA", adapter_name="cakeify") | |
| pipeline.set_adapters("cakeify") | |
| # use "CAKEIFY" to trigger the LoRA | |
| prompt = "CAKEIFY a person using a knife to cut a cake shaped like a Pikachu plushie" | |
| image = load_image("https://huggingface.co/Lightricks/LTX-Video-Cakeify-LoRA/resolve/main/assets/images/pikachu.png") | |
| video = pipeline( | |
| prompt=prompt, | |
| image=image, | |
| width=576, | |
| height=576, | |
| num_frames=161, | |
| decode_timestep=0.03, | |
| decode_noise_scale=0.025, | |
| num_inference_steps=50, | |
| ).frames[0] | |
| export_to_video(video, "output.mp4", fps=26) | |
| ``` | |
| </details> | |
| - LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [loaders.FromOriginalModelMixin.from_single_file()](/docs/diffusers/pr_12229/en/api/loaders/single_file#diffusers.loaders.FromOriginalModelMixin.from_single_file) or [loaders.FromSingleFileMixin.from_single_file()](/docs/diffusers/pr_12229/en/api/loaders/single_file#diffusers.loaders.FromSingleFileMixin.from_single_file). | |
| <details> | |
| <summary>Show example code</summary> | |
| ```py | |
| import torch | |
| from diffusers.utils import export_to_video | |
| from diffusers import LTXPipeline, AutoModel, GGUFQuantizationConfig | |
| transformer = AutoModel.from_single_file( | |
| "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf" | |
| quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| pipeline = LTXPipeline.from_pretrained( | |
| "Lightricks/LTX-Video", | |
| transformer=transformer, | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| ``` | |
| </details> | |
| ## LTXPipeline[[diffusers.LTXPipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.LTXPipeline</name><anchor>diffusers.LTXPipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx.py#L170</source><parameters>[{"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}, {"name": "vae", "val": ": AutoencoderKLLTXVideo"}, {"name": "text_encoder", "val": ": T5EncoderModel"}, {"name": "tokenizer", "val": ": T5TokenizerFast"}, {"name": "transformer", "val": ": LTXVideoTransformer3DModel"}]</parameters><paramsdesc>- **transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/pr_12229/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) -- | |
| Conditional Transformer architecture to denoise the encoded video latents. | |
| - **scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_12229/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) -- | |
| A scheduler to be used in combination with `transformer` to denoise the encoded image latents. | |
| - **vae** ([AutoencoderKLLTXVideo](/docs/diffusers/pr_12229/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) -- | |
| Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. | |
| - **text_encoder** (`T5EncoderModel`) -- | |
| [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically | |
| the [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant. | |
| - **tokenizer** (`CLIPTokenizer`) -- | |
| Tokenizer of class | |
| [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer). | |
| - **tokenizer** (`T5TokenizerFast`) -- | |
| Second Tokenizer of class | |
| [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Pipeline for text-to-video generation. | |
| Reference: https://github.com/Lightricks/LTX-Video | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.LTXPipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx.py#L535</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 704"}, {"name": "num_frames", "val": ": int = 161"}, {"name": "frame_rate", "val": ": int = 25"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "guidance_scale", "val": ": float = 3"}, {"name": "guidance_rescale", "val": ": float = 0.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "decode_timestep", "val": ": typing.Union[float, typing.List[float]] = 0.0"}, {"name": "decode_noise_scale", "val": ": typing.Union[float, typing.List[float], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 128"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. | |
| instead. | |
| - **height** (`int`, defaults to `512`) -- | |
| The height in pixels of the generated image. This is set to 480 by default for the best results. | |
| - **width** (`int`, defaults to `704`) -- | |
| The width in pixels of the generated image. This is set to 848 by default for the best results. | |
| - **num_frames** (`int`, defaults to `161`) -- | |
| The number of video frames to generate | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **timesteps** (`List[int]`, *optional*) -- | |
| Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument | |
| in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is | |
| passed will be used. Must be in descending order. | |
| - **guidance_scale** (`float`, defaults to `3 `) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **guidance_rescale** (`float`, *optional*, defaults to 0.0) -- | |
| Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are | |
| Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of | |
| [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). | |
| Guidance rescale factor should fix overexposure when using zero terminal SNR. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| The number of videos to generate per prompt. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by sampling using the supplied random `generator`. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Pre-generated attention mask for text embeddings. | |
| - **negative_prompt_embeds** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not | |
| provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. | |
| - **negative_prompt_attention_mask** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated attention mask for negative text embeddings. | |
| - **decode_timestep** (`float`, defaults to `0.0`) -- | |
| The timestep at which generated video is decoded. | |
| - **decode_noise_scale** (`float`, defaults to `None`) -- | |
| The interpolation factor between random noise and denoised latents at the decode timestep. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generate image. Choose between | |
| [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple. | |
| - **attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under | |
| `self.processor` in | |
| [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **callback_on_step_end** (`Callable`, *optional*) -- | |
| A function that calls at the end of each denoising steps during the inference. The function is called | |
| with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, | |
| callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by | |
| `callback_on_step_end_tensor_inputs`. | |
| - **callback_on_step_end_tensor_inputs** (`List`, *optional*) -- | |
| The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list | |
| will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the | |
| `._callback_tensor_inputs` attribute of your pipeline class. | |
| - **max_sequence_length** (`int` defaults to `128 `) -- | |
| Maximum sequence length to use with the `prompt`.</paramsdesc><paramgroups>0</paramgroups><rettype>`~pipelines.ltx.LTXPipelineOutput` or `tuple`</rettype><retdesc>If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is | |
| returned where the first element is a list with the generated images.</retdesc></docstring> | |
| Function invoked when calling the pipeline for generation. | |
| <ExampleCodeBlock anchor="diffusers.LTXPipeline.__call__.example"> | |
| Examples: | |
| ```py | |
| >>> import torch | |
| >>> from diffusers import LTXPipeline | |
| >>> from diffusers.utils import export_to_video | |
| >>> pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) | |
| >>> pipe.to("cuda") | |
| >>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage" | |
| >>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| >>> video = pipe( | |
| ... prompt=prompt, | |
| ... negative_prompt=negative_prompt, | |
| ... width=704, | |
| ... height=480, | |
| ... num_frames=161, | |
| ... num_inference_steps=50, | |
| ... ).frames[0] | |
| >>> export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_prompt</name><anchor>diffusers.LTXPipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx.py#L283</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "num_videos_per_prompt", "val": ": int = 1"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "max_sequence_length", "val": ": int = 128"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| prompt to be encoded | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **do_classifier_free_guidance** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to use classifier free guidance or not. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| Number of videos that should be generated per prompt. torch device to place the resulting embeddings on | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **device** -- (`torch.device`, *optional*): | |
| torch device | |
| - **dtype** -- (`torch.dtype`, *optional*): | |
| torch dtype</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Encodes the prompt into text encoder hidden states. | |
| </div></div> | |
| ## LTXImageToVideoPipeline[[diffusers.LTXImageToVideoPipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.LTXImageToVideoPipeline</name><anchor>diffusers.LTXImageToVideoPipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L189</source><parameters>[{"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}, {"name": "vae", "val": ": AutoencoderKLLTXVideo"}, {"name": "text_encoder", "val": ": T5EncoderModel"}, {"name": "tokenizer", "val": ": T5TokenizerFast"}, {"name": "transformer", "val": ": LTXVideoTransformer3DModel"}]</parameters><paramsdesc>- **transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/pr_12229/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) -- | |
| Conditional Transformer architecture to denoise the encoded video latents. | |
| - **scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_12229/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) -- | |
| A scheduler to be used in combination with `transformer` to denoise the encoded image latents. | |
| - **vae** ([AutoencoderKLLTXVideo](/docs/diffusers/pr_12229/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) -- | |
| Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. | |
| - **text_encoder** (`T5EncoderModel`) -- | |
| [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically | |
| the [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant. | |
| - **tokenizer** (`CLIPTokenizer`) -- | |
| Tokenizer of class | |
| [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer). | |
| - **tokenizer** (`T5TokenizerFast`) -- | |
| Second Tokenizer of class | |
| [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Pipeline for image-to-video generation. | |
| Reference: https://github.com/Lightricks/LTX-Video | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.LTXImageToVideoPipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L596</source><parameters>[{"name": "image", "val": ": typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None"}, {"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 704"}, {"name": "num_frames", "val": ": int = 161"}, {"name": "frame_rate", "val": ": int = 25"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "guidance_scale", "val": ": float = 3"}, {"name": "guidance_rescale", "val": ": float = 0.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "decode_timestep", "val": ": typing.Union[float, typing.List[float]] = 0.0"}, {"name": "decode_noise_scale", "val": ": typing.Union[float, typing.List[float], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 128"}]</parameters><paramsdesc>- **image** (`PipelineImageInput`) -- | |
| The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`. | |
| - **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. | |
| instead. | |
| - **height** (`int`, defaults to `512`) -- | |
| The height in pixels of the generated image. This is set to 480 by default for the best results. | |
| - **width** (`int`, defaults to `704`) -- | |
| The width in pixels of the generated image. This is set to 848 by default for the best results. | |
| - **num_frames** (`int`, defaults to `161`) -- | |
| The number of video frames to generate | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **timesteps** (`List[int]`, *optional*) -- | |
| Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument | |
| in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is | |
| passed will be used. Must be in descending order. | |
| - **guidance_scale** (`float`, defaults to `3 `) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **guidance_rescale** (`float`, *optional*, defaults to 0.0) -- | |
| Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are | |
| Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of | |
| [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). | |
| Guidance rescale factor should fix overexposure when using zero terminal SNR. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| The number of videos to generate per prompt. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by sampling using the supplied random `generator`. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Pre-generated attention mask for text embeddings. | |
| - **negative_prompt_embeds** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not | |
| provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. | |
| - **negative_prompt_attention_mask** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated attention mask for negative text embeddings. | |
| - **decode_timestep** (`float`, defaults to `0.0`) -- | |
| The timestep at which generated video is decoded. | |
| - **decode_noise_scale** (`float`, defaults to `None`) -- | |
| The interpolation factor between random noise and denoised latents at the decode timestep. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generate image. Choose between | |
| [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple. | |
| - **attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under | |
| `self.processor` in | |
| [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **callback_on_step_end** (`Callable`, *optional*) -- | |
| A function that calls at the end of each denoising steps during the inference. The function is called | |
| with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, | |
| callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by | |
| `callback_on_step_end_tensor_inputs`. | |
| - **callback_on_step_end_tensor_inputs** (`List`, *optional*) -- | |
| The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list | |
| will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the | |
| `._callback_tensor_inputs` attribute of your pipeline class. | |
| - **max_sequence_length** (`int` defaults to `128 `) -- | |
| Maximum sequence length to use with the `prompt`.</paramsdesc><paramgroups>0</paramgroups><rettype>`~pipelines.ltx.LTXPipelineOutput` or `tuple`</rettype><retdesc>If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is | |
| returned where the first element is a list with the generated images.</retdesc></docstring> | |
| Function invoked when calling the pipeline for generation. | |
| <ExampleCodeBlock anchor="diffusers.LTXImageToVideoPipeline.__call__.example"> | |
| Examples: | |
| ```py | |
| >>> import torch | |
| >>> from diffusers import LTXImageToVideoPipeline | |
| >>> from diffusers.utils import export_to_video, load_image | |
| >>> pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) | |
| >>> pipe.to("cuda") | |
| >>> image = load_image( | |
| ... "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png" | |
| ... ) | |
| >>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene." | |
| >>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| >>> video = pipe( | |
| ... image=image, | |
| ... prompt=prompt, | |
| ... negative_prompt=negative_prompt, | |
| ... width=704, | |
| ... height=480, | |
| ... num_frames=161, | |
| ... num_inference_steps=50, | |
| ... ).frames[0] | |
| >>> export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_prompt</name><anchor>diffusers.LTXImageToVideoPipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L306</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "num_videos_per_prompt", "val": ": int = 1"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "max_sequence_length", "val": ": int = 128"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| prompt to be encoded | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **do_classifier_free_guidance** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to use classifier free guidance or not. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| Number of videos that should be generated per prompt. torch device to place the resulting embeddings on | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **device** -- (`torch.device`, *optional*): | |
| torch device | |
| - **dtype** -- (`torch.dtype`, *optional*): | |
| torch dtype</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Encodes the prompt into text encoder hidden states. | |
| </div></div> | |
| ## LTXConditionPipeline[[diffusers.LTXConditionPipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.LTXConditionPipeline</name><anchor>diffusers.LTXConditionPipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py#L252</source><parameters>[{"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}, {"name": "vae", "val": ": AutoencoderKLLTXVideo"}, {"name": "text_encoder", "val": ": T5EncoderModel"}, {"name": "tokenizer", "val": ": T5TokenizerFast"}, {"name": "transformer", "val": ": LTXVideoTransformer3DModel"}]</parameters><paramsdesc>- **transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/pr_12229/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) -- | |
| Conditional Transformer architecture to denoise the encoded video latents. | |
| - **scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_12229/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) -- | |
| A scheduler to be used in combination with `transformer` to denoise the encoded image latents. | |
| - **vae** ([AutoencoderKLLTXVideo](/docs/diffusers/pr_12229/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) -- | |
| Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. | |
| - **text_encoder** (`T5EncoderModel`) -- | |
| [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically | |
| the [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant. | |
| - **tokenizer** (`CLIPTokenizer`) -- | |
| Tokenizer of class | |
| [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer). | |
| - **tokenizer** (`T5TokenizerFast`) -- | |
| Second Tokenizer of class | |
| [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Pipeline for text/image/video-to-video generation. | |
| Reference: https://github.com/Lightricks/LTX-Video | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.LTXConditionPipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py#L848</source><parameters>[{"name": "conditions", "val": ": typing.Union[diffusers.pipelines.ltx.pipeline_ltx_condition.LTXVideoCondition, typing.List[diffusers.pipelines.ltx.pipeline_ltx_condition.LTXVideoCondition]] = None"}, {"name": "image", "val": ": typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None"}, {"name": "video", "val": ": typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None"}, {"name": "frame_index", "val": ": typing.Union[int, typing.List[int]] = 0"}, {"name": "strength", "val": ": typing.Union[float, typing.List[float]] = 1.0"}, {"name": "denoise_strength", "val": ": float = 1.0"}, {"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 704"}, {"name": "num_frames", "val": ": int = 161"}, {"name": "frame_rate", "val": ": int = 25"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "guidance_scale", "val": ": float = 3"}, {"name": "guidance_rescale", "val": ": float = 0.0"}, {"name": "image_cond_noise_scale", "val": ": float = 0.15"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "decode_timestep", "val": ": typing.Union[float, typing.List[float]] = 0.0"}, {"name": "decode_noise_scale", "val": ": typing.Union[float, typing.List[float], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 256"}]</parameters><paramsdesc>- **conditions** (`List[LTXVideoCondition], *optional*`) -- | |
| The list of frame-conditioning items for the video generation.If not provided, conditions will be | |
| created using `image`, `video`, `frame_index` and `strength`. | |
| - **image** (`PipelineImageInput` or `List[PipelineImageInput]`, *optional*) -- | |
| The image or images to condition the video generation. If not provided, one has to pass `video` or | |
| `conditions`. | |
| - **video** (`List[PipelineImageInput]`, *optional*) -- | |
| The video to condition the video generation. If not provided, one has to pass `image` or `conditions`. | |
| - **frame_index** (`int` or `List[int]`, *optional*) -- | |
| The frame index or frame indices at which the image or video will conditionally effect the video | |
| generation. If not provided, one has to pass `conditions`. | |
| - **strength** (`float` or `List[float]`, *optional*) -- | |
| The strength or strengths of the conditioning effect. If not provided, one has to pass `conditions`. | |
| - **denoise_strength** (`float`, defaults to `1.0`) -- | |
| The strength of the noise added to the latents for editing. Higher strength leads to more noise added | |
| to the latents, therefore leading to more differences between original video and generated video. This | |
| is useful for video-to-video editing. | |
| - **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. | |
| instead. | |
| - **height** (`int`, defaults to `512`) -- | |
| The height in pixels of the generated image. This is set to 480 by default for the best results. | |
| - **width** (`int`, defaults to `704`) -- | |
| The width in pixels of the generated image. This is set to 848 by default for the best results. | |
| - **num_frames** (`int`, defaults to `161`) -- | |
| The number of video frames to generate | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **timesteps** (`List[int]`, *optional*) -- | |
| Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument | |
| in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is | |
| passed will be used. Must be in descending order. | |
| - **guidance_scale** (`float`, defaults to `3 `) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **guidance_rescale** (`float`, *optional*, defaults to 0.0) -- | |
| Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are | |
| Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of | |
| [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). | |
| Guidance rescale factor should fix overexposure when using zero terminal SNR. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| The number of videos to generate per prompt. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by sampling using the supplied random `generator`. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Pre-generated attention mask for text embeddings. | |
| - **negative_prompt_embeds** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not | |
| provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. | |
| - **negative_prompt_attention_mask** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated attention mask for negative text embeddings. | |
| - **decode_timestep** (`float`, defaults to `0.0`) -- | |
| The timestep at which generated video is decoded. | |
| - **decode_noise_scale** (`float`, defaults to `None`) -- | |
| The interpolation factor between random noise and denoised latents at the decode timestep. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generate image. Choose between | |
| [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple. | |
| - **attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under | |
| `self.processor` in | |
| [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **callback_on_step_end** (`Callable`, *optional*) -- | |
| A function that calls at the end of each denoising steps during the inference. The function is called | |
| with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, | |
| callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by | |
| `callback_on_step_end_tensor_inputs`. | |
| - **callback_on_step_end_tensor_inputs** (`List`, *optional*) -- | |
| The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list | |
| will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the | |
| `._callback_tensor_inputs` attribute of your pipeline class. | |
| - **max_sequence_length** (`int` defaults to `128 `) -- | |
| Maximum sequence length to use with the `prompt`.</paramsdesc><paramgroups>0</paramgroups><rettype>`~pipelines.ltx.LTXPipelineOutput` or `tuple`</rettype><retdesc>If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is | |
| returned where the first element is a list with the generated images.</retdesc></docstring> | |
| Function invoked when calling the pipeline for generation. | |
| <ExampleCodeBlock anchor="diffusers.LTXConditionPipeline.__call__.example"> | |
| Examples: | |
| ```py | |
| >>> import torch | |
| >>> from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXConditionPipeline, LTXVideoCondition | |
| >>> from diffusers.utils import export_to_video, load_video, load_image | |
| >>> pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16) | |
| >>> pipe.to("cuda") | |
| >>> # Load input image and video | |
| >>> video = load_video( | |
| ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" | |
| ... ) | |
| >>> image = load_image( | |
| ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg" | |
| ... ) | |
| >>> # Create conditioning objects | |
| >>> condition1 = LTXVideoCondition( | |
| ... image=image, | |
| ... frame_index=0, | |
| ... ) | |
| >>> condition2 = LTXVideoCondition( | |
| ... video=video, | |
| ... frame_index=80, | |
| ... ) | |
| >>> prompt = "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day. And then the camera switch to a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region." | |
| >>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | |
| >>> # Generate video | |
| >>> generator = torch.Generator("cuda").manual_seed(0) | |
| >>> # Text-only conditioning is also supported without the need to pass `conditions` | |
| >>> video = pipe( | |
| ... conditions=[condition1, condition2], | |
| ... prompt=prompt, | |
| ... negative_prompt=negative_prompt, | |
| ... width=768, | |
| ... height=512, | |
| ... num_frames=161, | |
| ... num_inference_steps=40, | |
| ... generator=generator, | |
| ... ).frames[0] | |
| >>> export_to_video(video, "output.mp4", fps=24) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>add_noise_to_image_conditioning_latents</name><anchor>diffusers.LTXConditionPipeline.add_noise_to_image_conditioning_latents</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py#L646</source><parameters>[{"name": "t", "val": ": float"}, {"name": "init_latents", "val": ": Tensor"}, {"name": "latents", "val": ": Tensor"}, {"name": "noise_scale", "val": ": float"}, {"name": "conditioning_mask", "val": ": Tensor"}, {"name": "generator", "val": ""}, {"name": "eps", "val": " = 1e-06"}]</parameters></docstring> | |
| Add timestep-dependent noise to the hard-conditioning latents. This helps with motion continuity, especially | |
| when conditioned on a single frame. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_prompt</name><anchor>diffusers.LTXConditionPipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py#L369</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "num_videos_per_prompt", "val": ": int = 1"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "max_sequence_length", "val": ": int = 256"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| prompt to be encoded | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **do_classifier_free_guidance** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to use classifier free guidance or not. | |
| - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| Number of videos that should be generated per prompt. torch device to place the resulting embeddings on | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **device** -- (`torch.device`, *optional*): | |
| torch device | |
| - **dtype** -- (`torch.dtype`, *optional*): | |
| torch dtype</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Encodes the prompt into text encoder hidden states. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>trim_conditioning_sequence</name><anchor>diffusers.LTXConditionPipeline.trim_conditioning_sequence</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py#L629</source><parameters>[{"name": "start_frame", "val": ": int"}, {"name": "sequence_num_frames", "val": ": int"}, {"name": "target_num_frames", "val": ": int"}]</parameters><paramsdesc>- **start_frame** (int) -- The target frame number of the first frame in the sequence. | |
| - **sequence_num_frames** (int) -- The number of frames in the sequence. | |
| - **target_num_frames** (int) -- The target number of frames in the generated video.</paramsdesc><paramgroups>0</paramgroups><rettype>int</rettype><retdesc>updated sequence length</retdesc></docstring> | |
| Trim a conditioning sequence to the allowed number of frames. | |
| </div></div> | |
| ## LTXLatentUpsamplePipeline[[diffusers.LTXLatentUpsamplePipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.LTXLatentUpsamplePipeline</name><anchor>diffusers.LTXLatentUpsamplePipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L46</source><parameters>[{"name": "vae", "val": ": AutoencoderKLLTXVideo"}, {"name": "latent_upsampler", "val": ": LTXLatentUpsamplerModel"}]</parameters></docstring> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.LTXLatentUpsamplePipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L208</source><parameters>[{"name": "video", "val": ": typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 704"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "decode_timestep", "val": ": typing.Union[float, typing.List[float]] = 0.0"}, {"name": "decode_noise_scale", "val": ": typing.Union[float, typing.List[float], NoneType] = None"}, {"name": "adain_factor", "val": ": float = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>adain_filter_latent</name><anchor>diffusers.LTXLatentUpsamplePipeline.adain_filter_latent</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L96</source><parameters>[{"name": "latents", "val": ": Tensor"}, {"name": "reference_latents", "val": ": Tensor"}, {"name": "factor", "val": ": float = 1.0"}]</parameters><paramsdesc>- **latent** (`torch.Tensor`) -- | |
| Input latents to normalize | |
| - **reference_latents** (`torch.Tensor`) -- | |
| The reference latents providing style statistics. | |
| - **factor** (`float`) -- | |
| Blending factor between original and transformed latent. Range: -10.0 to 10.0, Default: 1.0</paramsdesc><paramgroups>0</paramgroups><rettype>torch.Tensor</rettype><retdesc>The transformed latent tensor</retdesc></docstring> | |
| Applies Adaptive Instance Normalization (AdaIN) to a latent tensor based on statistics from a reference latent | |
| tensor. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>disable_vae_slicing</name><anchor>diffusers.LTXLatentUpsamplePipeline.disable_vae_slicing</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L159</source><parameters>[]</parameters></docstring> | |
| Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to | |
| computing decoding in one step. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>disable_vae_tiling</name><anchor>diffusers.LTXLatentUpsamplePipeline.disable_vae_tiling</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L186</source><parameters>[]</parameters></docstring> | |
| Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to | |
| computing decoding in one step. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>enable_vae_slicing</name><anchor>diffusers.LTXLatentUpsamplePipeline.enable_vae_slicing</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L146</source><parameters>[]</parameters></docstring> | |
| Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to | |
| compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>enable_vae_tiling</name><anchor>diffusers.LTXLatentUpsamplePipeline.enable_vae_tiling</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_ltx_latent_upsample.py#L172</source><parameters>[]</parameters></docstring> | |
| Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to | |
| compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow | |
| processing larger images. | |
| </div></div> | |
| ## LTXPipelineOutput[[diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput</name><anchor>diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/pipelines/ltx/pipeline_output.py#L9</source><parameters>[{"name": "frames", "val": ": Tensor"}]</parameters><paramsdesc>- **frames** (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]) -- | |
| List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing | |
| denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape | |
| `(batch_size, num_frames, channels, height, width)`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Output class for LTX pipelines. | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/ltx_video.md" /> |
Xet Storage Details
- Size:
- 68.9 kB
- Xet hash:
- cf88cdda5641f51dd74de0cf4eba2bb0327777307c4d73c706ba4a3f5653d107
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.