Buckets:
| # AutoencoderKLCogVideoX | |
| The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. | |
| The model can be loaded with the following code snippet. | |
| ```python | |
| from diffusers import AutoencoderKLCogVideoX | |
| vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda") | |
| ``` | |
| ## AutoencoderKLCogVideoX[[diffusers.AutoencoderKLCogVideoX]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.AutoencoderKLCogVideoX</name><anchor>diffusers.AutoencoderKLCogVideoX</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L958</source><parameters>[{"name": "in_channels", "val": ": int = 3"}, {"name": "out_channels", "val": ": int = 3"}, {"name": "down_block_types", "val": ": typing.Tuple[str] = ('CogVideoXDownBlock3D', 'CogVideoXDownBlock3D', 'CogVideoXDownBlock3D', 'CogVideoXDownBlock3D')"}, {"name": "up_block_types", "val": ": typing.Tuple[str] = ('CogVideoXUpBlock3D', 'CogVideoXUpBlock3D', 'CogVideoXUpBlock3D', 'CogVideoXUpBlock3D')"}, {"name": "block_out_channels", "val": ": typing.Tuple[int] = (128, 256, 256, 512)"}, {"name": "latent_channels", "val": ": int = 16"}, {"name": "layers_per_block", "val": ": int = 3"}, {"name": "act_fn", "val": ": str = 'silu'"}, {"name": "norm_eps", "val": ": float = 1e-06"}, {"name": "norm_num_groups", "val": ": int = 32"}, {"name": "temporal_compression_ratio", "val": ": float = 4"}, {"name": "sample_height", "val": ": int = 480"}, {"name": "sample_width", "val": ": int = 720"}, {"name": "scaling_factor", "val": ": float = 1.15258426"}, {"name": "shift_factor", "val": ": typing.Optional[float] = None"}, {"name": "latents_mean", "val": ": typing.Optional[typing.Tuple[float]] = None"}, {"name": "latents_std", "val": ": typing.Optional[typing.Tuple[float]] = None"}, {"name": "force_upcast", "val": ": float = True"}, {"name": "use_quant_conv", "val": ": bool = False"}, {"name": "use_post_quant_conv", "val": ": bool = False"}, {"name": "invert_scale_latents", "val": ": bool = False"}]</parameters><paramsdesc>- **in_channels** (int, *optional*, defaults to 3) -- Number of channels in the input image. | |
| - **out_channels** (int, *optional*, defaults to 3) -- Number of channels in the output. | |
| - **down_block_types** (`Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)`) -- | |
| Tuple of downsample block types. | |
| - **up_block_types** (`Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)`) -- | |
| Tuple of upsample block types. | |
| - **block_out_channels** (`Tuple[int]`, *optional*, defaults to `(64,)`) -- | |
| Tuple of block output channels. | |
| - **act_fn** (`str`, *optional*, defaults to `"silu"`) -- The activation function to use. | |
| - **sample_size** (`int`, *optional*, defaults to `32`) -- Sample input size. | |
| - **scaling_factor** (`float`, *optional*, defaults to `1.15258426`) -- | |
| The component-wise standard deviation of the trained latent space computed using the first batch of the | |
| training set. This is used to scale the latent space to have unit variance when training the diffusion | |
| model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the | |
| diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z = 1 | |
| / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image | |
| Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) paper. | |
| - **force_upcast** (`bool`, *optional*, default to `True`) -- | |
| If enabled it will force the VAE to run in float32 for high image resolution pipelines, such as SD-XL. VAE | |
| can be fine-tuned / trained to a lower range without losing too much precision in which case `force_upcast` | |
| can be set to `False` - see: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| A VAE model with KL loss for encoding images into latents and decoding latent representations into images. Used in | |
| [CogVideoX](https://github.com/THUDM/CogVideo). | |
| This model inherits from [ModelMixin](/docs/diffusers/pr_12229/en/api/models/overview#diffusers.ModelMixin). Check the superclass documentation for it's generic methods implemented | |
| for all models (such as downloading or saving). | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>wrapper</name><anchor>diffusers.AutoencoderKLCogVideoX.decode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/utils/accelerate_utils.py#L43</source><parameters>[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>wrapper</name><anchor>diffusers.AutoencoderKLCogVideoX.encode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/utils/accelerate_utils.py#L43</source><parameters>[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>disable_slicing</name><anchor>diffusers.AutoencoderKLCogVideoX.disable_slicing</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1141</source><parameters>[]</parameters></docstring> | |
| Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing | |
| decoding in one step. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>disable_tiling</name><anchor>diffusers.AutoencoderKLCogVideoX.disable_tiling</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1127</source><parameters>[]</parameters></docstring> | |
| Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing | |
| decoding in one step. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>enable_slicing</name><anchor>diffusers.AutoencoderKLCogVideoX.enable_slicing</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1134</source><parameters>[]</parameters></docstring> | |
| Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to | |
| compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>enable_tiling</name><anchor>diffusers.AutoencoderKLCogVideoX.enable_tiling</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1091</source><parameters>[{"name": "tile_sample_min_height", "val": ": typing.Optional[int] = None"}, {"name": "tile_sample_min_width", "val": ": typing.Optional[int] = None"}, {"name": "tile_overlap_factor_height", "val": ": typing.Optional[float] = None"}, {"name": "tile_overlap_factor_width", "val": ": typing.Optional[float] = None"}]</parameters><paramsdesc>- **tile_sample_min_height** (`int`, *optional*) -- | |
| The minimum height required for a sample to be separated into tiles across the height dimension. | |
| - **tile_sample_min_width** (`int`, *optional*) -- | |
| The minimum width required for a sample to be separated into tiles across the width dimension. | |
| - **tile_overlap_factor_height** (`int`, *optional*) -- | |
| The minimum amount of overlap between two consecutive vertical tiles. This is to ensure that there are | |
| no tiling artifacts produced across the height dimension. Must be between 0 and 1. Setting a higher | |
| value might cause more tiles to be processed leading to slow down of the decoding process. | |
| - **tile_overlap_factor_width** (`int`, *optional*) -- | |
| The minimum amount of overlap between two consecutive horizontal tiles. This is to ensure that there | |
| are no tiling artifacts produced across the width dimension. Must be between 0 and 1. Setting a higher | |
| value might cause more tiles to be processed leading to slow down of the decoding process.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to | |
| compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow | |
| processing larger images. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>tiled_decode</name><anchor>diffusers.AutoencoderKLCogVideoX.tiled_decode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1345</source><parameters>[{"name": "z", "val": ": Tensor"}, {"name": "return_dict", "val": ": bool = True"}]</parameters><paramsdesc>- **z** (`torch.Tensor`) -- Input batch of latent vectors. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~models.vae.DecoderOutput` instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>`~models.vae.DecoderOutput` or `tuple`</rettype><retdesc>If return_dict is True, a `~models.vae.DecoderOutput` is returned, otherwise a plain `tuple` is | |
| returned.</retdesc></docstring> | |
| Decode a batch of images using a tiled decoder. | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>tiled_encode</name><anchor>diffusers.AutoencoderKLCogVideoX.tiled_encode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1271</source><parameters>[{"name": "x", "val": ": Tensor"}]</parameters><paramsdesc>- **x** (`torch.Tensor`) -- Input batch of videos.</paramsdesc><paramgroups>0</paramgroups><rettype>`torch.Tensor`</rettype><retdesc>The latent representation of the encoded videos.</retdesc></docstring> | |
| Encode a batch of images using a tiled encoder. | |
| When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several | |
| steps. This is useful to keep memory use constant regardless of image size. The end result of tiled encoding is | |
| different from non-tiled encoding because each tile uses a different encoder. To avoid tiling artifacts, the | |
| tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the | |
| output, but they should be much less noticeable. | |
| </div></div> | |
| ## AutoencoderKLOutput[[diffusers.models.modeling_outputs.AutoencoderKLOutput]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.models.modeling_outputs.AutoencoderKLOutput</name><anchor>diffusers.models.modeling_outputs.AutoencoderKLOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/modeling_outputs.py#L7</source><parameters>[{"name": "latent_dist", "val": ": DiagonalGaussianDistribution"}]</parameters><paramsdesc>- **latent_dist** (`DiagonalGaussianDistribution`) -- | |
| Encoded outputs of `Encoder` represented as the mean and logvar of `DiagonalGaussianDistribution`. | |
| `DiagonalGaussianDistribution` allows for sampling latents from the distribution.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Output of AutoencoderKL encoding method. | |
| </div> | |
| ## DecoderOutput[[diffusers.models.autoencoders.vae.DecoderOutput]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.models.autoencoders.vae.DecoderOutput</name><anchor>diffusers.models.autoencoders.vae.DecoderOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12229/src/diffusers/models/autoencoders/vae.py#L47</source><parameters>[{"name": "sample", "val": ": Tensor"}, {"name": "commit_loss", "val": ": typing.Optional[torch.FloatTensor] = None"}]</parameters><paramsdesc>- **sample** (`torch.Tensor` of shape `(batch_size, num_channels, height, width)`) -- | |
| The decoded output sample from the last layer of the model.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Output of decoding method. | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/models/autoencoderkl_cogvideox.md" /> |
Xet Storage Details
- Size:
- 13.1 kB
- Xet hash:
- b71e765644c72063285d841b4ba9ed11a278136c41ae47b2b1a5b0c4de012744
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.