Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / diffusers /main /en /api /models /cosmos_transformer3d.md

HuggingFaceDocBuilder

about 10 hours ago

preview code

download

raw

3.98 kB

	# CosmosTransformer3DModel

	A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.

	The model can be loaded with the following code snippet.

	```python
	from diffusers import CosmosTransformer3DModel

	transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16)
	```

	## CosmosTransformer3DModel[[diffusers.CosmosTransformer3DModel]]

	#### diffusers.CosmosTransformer3DModel[[diffusers.CosmosTransformer3DModel]]

	[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_cosmos.py#L554)

	A Transformer model for video-like data used in [Cosmos](https://github.com/NVIDIA/Cosmos).

	Parameters:

	in_channels (`int`, defaults to `16`) : The number of channels in the input.

	out_channels (`int`, defaults to `16`) : The number of channels in the output.

	num_attention_heads (`int`, defaults to `32`) : The number of heads to use for multi-head attention.

	attention_head_dim (`int`, defaults to `128`) : The number of channels in each attention head.

	num_layers (`int`, defaults to `28`) : The number of layers of transformer blocks to use.

	mlp_ratio (`float`, defaults to `4.0`) : The ratio of the hidden layer size to the input size in the feedforward network.

	text_embed_dim (`int`, defaults to `4096`) : Input dimension of text embeddings from the text encoder.

	adaln_lora_dim (`int`, defaults to `256`) : The hidden dimension of the Adaptive LayerNorm LoRA layer.

	max_size (`tuple[int, int, int]`, defaults to `(128, 240, 240)`) : The maximum size of the input latent tensors in the temporal, height, and width dimensions.

	patch_size (`tuple[int, int, int]`, defaults to `(1, 2, 2)`) : The patch size to use for patchifying the input latent tensors in the temporal, height, and width dimensions.

	rope_scale (`tuple[float, float, float]`, defaults to `(2.0, 1.0, 1.0)`) : The scaling factor to use for RoPE in the temporal, height, and width dimensions.

	concat_padding_mask (`bool`, defaults to `True`) : Whether to concatenate the padding mask to the input latent tensors.

	extra_pos_embed_type (`str`, optional, defaults to `learnable`) : The type of extra positional embeddings to use. Can be one of `None` or `learnable`.

	controlnet_block_every_n (`int`, optional) : Interval between transformer blocks that should receive control residuals (for example, `7` to inject after every seventh block). Required for Cosmos Transfer2.5.

	img_context_dim_in (`int`, optional) : The dimension of the input image context feature vector, i.e. it is the D in [B, N, D].

	img_context_num_tokens (`int`) : The number of tokens in the image context feature vector, i.e. it is the N in [B, N, D]. If `img_context_dim_in` is not provided, then this parameter is ignored.

	img_context_dim_out (`int`) : The output dimension of the image context projection layer. If `img_context_dim_in` is not provided, then this parameter is ignored.

	## Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

	#### diffusers.models.modeling_outputs.Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

	[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/modeling_outputs.py#L21)

	The output of [Transformer2DModel](/docs/diffusers/main/en/api/models/transformer2d#diffusers.Transformer2DModel).

	Parameters:

	sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` or `(batch size, num_vector_embeds - 1, num_latent_pixels)` if [Transformer2DModel](/docs/diffusers/main/en/api/models/transformer2d#diffusers.Transformer2DModel) is discrete) : The hidden states output conditioned on the `encoder_hidden_states` input. If discrete, returns probability distributions for the unnoised latent pixels.

Xet Storage Details

Size:: 3.98 kB
Xet hash:: 7d8e4040168693bcc4c4b2cdf5cc9d69b272b152bbcddd2f286e32d7cbcf461c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.