Cosmos-Predict2.5-2B-Diffusers / SETUP_INSTRUCTIONS.md
KyleShao's picture
Upload folder using huggingface_hub
a63d81a verified

Setup Instructions for Cosmos-Predict2.5-2B-Diffusers

Missing Directories

You need to create the text_encoder directory:

mkdir -p text_encoder

Then create the following config files:

text_encoder/config.json

{
  "_class_name": "Reason1TextEncoder",
  "_diffusers_version": "0.34.0",
  "_name_or_path": "Qwen/Qwen2.5-VL-7B-Instruct",
  "tokenizer_type": "Qwen/Qwen2.5-VL-7B-Instruct",
  "arch_config": {
    "architectures": ["Qwen2_5_VLForConditionalGeneration"],
    "model_type": "qwen2_5_vl",
    "vocab_size": 152064,
    "hidden_size": 3584,
    "num_hidden_layers": 28,
    "num_attention_heads": 28,
    "num_key_value_heads": 4,
    "intermediate_size": 18944,
    "text_len": 512,
    "hidden_state_skip_layer": 0,
    "bos_token_id": 151643,
    "pad_token_id": 151643,
    "eos_token_id": 151645,
    "image_token_id": 151655,
    "video_token_id": 151656,
    "vision_token_id": 151654,
    "vision_start_token_id": 151652,
    "vision_end_token_id": 151653,
    "vision_config": null,
    "rope_theta": 1000000.0,
    "rope_scaling": {
      "type": "mrope",
      "mrope_section": [16, 24, 24]
    },
    "max_position_embeddings": 128000,
    "max_window_layers": 28,
    "embedding_concat_strategy": "mean_pooling",
    "n_layers_per_group": 5,
    "num_embedding_padding_tokens": 512,
    "attention_dropout": 0.0,
    "hidden_act": "silu",
    "initializer_range": 0.02,
    "rms_norm_eps": 1e-6,
    "use_sliding_window": false,
    "sliding_window": 32768,
    "tie_word_embeddings": false,
    "use_cache": false,
    "output_hidden_states": true,
    "torch_dtype": "bfloat16",
    "_attn_implementation": "flash_attention_2"
  }
}

Complete Directory Structure

After setup, your model folder should look like this:

models--nvidia--Cosmos-Predict2.5-2B-Diffusers/
β”œβ”€β”€ model_index.json
β”œβ”€β”€ README.md
β”œβ”€β”€ SETUP_INSTRUCTIONS.md (this file)
β”œβ”€β”€ text_encoder/
β”‚   └── config.json
β”œβ”€β”€ tokenizer/
β”‚   └── config.json
β”œβ”€β”€ transformer/
β”‚   β”œβ”€β”€ config.json
β”‚   └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt
β”œβ”€β”€ vae/
β”‚   β”œβ”€β”€ config.json
β”‚   └── tokenizer.pth
└── scheduler/
    └── scheduler_config.json

Notes

  • The text_encoder and tokenizer configs reference Qwen/Qwen2.5-VL-7B-Instruct
  • The actual Qwen model weights (~7B parameters) need to be downloaded separately from HuggingFace
  • The Reason1TextEncoder will automatically load from the Qwen checkpoint path specified in the config