CYChenv

tangyue0820

harrim-nv

liang1225

mli0603

mbalaNV commited on Jun 1

Commit

8889131

0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files

Co-authored-by: zekunhao <zekunhao@users.noreply.huggingface.co>
Co-authored-by: CYChenv <CYChenv@users.noreply.huggingface.co>
Co-authored-by: tangyue0820 <tangyue0820@users.noreply.huggingface.co>
Co-authored-by: harrim-nv <harrim-nv@users.noreply.huggingface.co>
Co-authored-by: liang1225 <liang1225@users.noreply.huggingface.co>
Co-authored-by: mli0603 <mli0603@users.noreply.huggingface.co>
Co-authored-by: mbalaNV <mbalaNV@users.noreply.huggingface.co>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +39 -0
BIAS.md +11 -0
EXPLAINABILITY.md +16 -0
PRIVACY.md +6 -0
README.md +505 -0
SAFETY.md +11 -0
assets/benchmark-image2video-leaderboard-all-models.png +3 -0
assets/benchmark-image2video-leaderboard.png +3 -0
assets/example_first_frame.png +3 -0
assets/example_original_prompt.txt +1 -0
assets/example_output.mp4 +3 -0
assets/example_output_diffusers.mp4 +3 -0
assets/example_prompt.json +1 -0
chat_template.json +3 -0
checkpoint.json +17 -0
config.json +244 -0
generation_config.json +14 -0
merges.txt +0 -0
model.safetensors.index.json +0 -0
model_index.json +24 -0
preprocessor_config.json +21 -0
scheduler/scheduler_config.json +33 -0
scripts/gen_video.py +64 -0
scripts/upsample_prompt.py +168 -0
text_tokenizer/added_tokens.json +28 -0
text_tokenizer/chat_template.jinja +120 -0
text_tokenizer/merges.txt +0 -0
text_tokenizer/special_tokens_map.json +31 -0
text_tokenizer/tokenizer.json +3 -0
text_tokenizer/tokenizer_config.json +239 -0
text_tokenizer/vocab.json +0 -0
tokenizer.json +0 -0
tokenizer_config.json +239 -0
transformer/config.json +54 -0
transformer/diffusion_pytorch_model-00001-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00002-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00003-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00004-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00005-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00006-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00007-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00008-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00009-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00010-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00011-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00012-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00013-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00014-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00015-of-00027.safetensors +3 -0
transformer/diffusion_pytorch_model-00016-of-00027.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,39 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+text_tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.gif filter=lfs diff=lfs merge=lfs -text

BIAS.md ADDED Viewed

	@@ -0,0 +1,11 @@

+## Bias
+| Field | Response |
+| :---- | :---- |
+| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None. |
+| Measures taken to mitigate against unwanted bias | Training, evaluation, and testing data are curated before release to filter restricted content, including content relating to protected classes. Model behavior is evaluated across Physical AI domains — robotics, autonomous vehicles, human-centric scenes, common scenes, industry, miscellaneous, and physics-oriented benchmarks — with attention to coverage across diverse demographic and contextual characteristics that affect protected-class outcomes. |
+| Which characteristic (feature) show(s) the greatest difference in performance?: | Greatest performance differences are observed in tasks requiring long-horizon temporal consistency, fine-grained physical interactions, and embodiment-specific action generation. Performance is generally stronger on common visual reasoning and world-generation tasks than on complex multi-agent, robotics-control, or tightly synchronized multimodal generation scenarios. |
+| Which feature(s) have the worst performance overall? | Performance is generally weakest in tasks requiring long-horizon temporal consistency, precise physical interactions, embodiment-specific action control, and strict audio-visual synchronization. |
+| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | Bias-specific methods applied during data processing include person-presence screening, demographic-taxonomy classification (age, gender, ethnicity), embedding-based diversity analysis, and dataset balancing across sources. Internal analysis surfaced: non-person scenes are more prevalent than person-centric content; demographic-taxonomy outputs on person-present samples are most frequently "uncertain" across age, gender, and ethnicity dimensions; and source-type variation, with people-centric image and video datasets showing higher demographic signal than document-, object-, robotics-, or scene-focused datasets. *(Quantitative details in the row below.)* Downstream deployments should add bias audits, fairness evaluation, red-teaming, demographically balanced fine-tuning, or counterfactual augmentation as mitigations. |
+| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Dataset analytics pipelines, metadata distribution analysis, heuristic quality checks, embedding-based clustering, model-assisted filtering systems, and benchmark evaluation suites are used to assess statistical imbalances and identify patterns that may introduce bias into model behavior. |
+| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | These datasets, such as OpenImages-derived detection-to-NLP datasets, visual grounding and VQA datasets, document/image understanding datasets, video/action understanding datasets, and NVIDIA-created or curated visual datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein). For instance, automated person-presence screening did not identify a person in approximately 58% of visual samples analyzed across approximately 400 datasets, while person-present signals were identified in approximately 42% of analyzed samples. In the subset where person-present signals were identified, these datasets contain uneven representation splits across the measured visual taxonomies: age outputs were most frequently uncertain, followed by child and adult; gender outputs were most frequently uncertain, followed by male and female; and ethnicity outputs were most frequently uncertain, followed by Hispanic and White as the most frequent identified categories. Dataset-level results vary by source type, with people-centric image and video datasets containing higher person-present and demographic-taxonomy signals than document-, object-, robotics-, or scene-focused datasets. To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, task-specific fairness evaluation, and red-teaming, along with fine-tuning with demographically balanced datasets and counterfactual data augmentation to align with the desired model behavior. This evaluation used a baseline of 200 samples across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses, identified as optimal thresholds for maximizing embedder accuracy. |

EXPLAINABILITY.md ADDED Viewed

	@@ -0,0 +1,16 @@

+## Explainability
+| Field | Response |
+| :---- | :---- |
+| Intended Application & Domain | World reasoning and generation for Physical AI. |
+| Model Type | Mixture-of-Transformers architecture with two towers. One is an autoregressive model for Physical AI reasoning; the other is a diffusion model for Physical AI generation. |
+| Intended Users | Physical AI developers, researchers, and practitioners building or evaluating autonomous vehicle, robotics, and world-generation workflows. |
+| Output | Images, videos, audio, and action commands. |
+| Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | Dataset provenance analysis, metadata validation, watermark and artifact detection, embedding-based clustering, heuristic quality checks, and model-assisted data validation pipelines are used to identify synthetic content patterns, assess dataset authenticity, and improve data quality during dataset curation. |
+| Describe how the model works | Cosmos3 is an Omni world foundation model that generates texts, images, videos, audio, and action commands from combinations of text, images, videos, and action trajectory inputs. Input tokens from multiple modalities are packed into a shared sequence and processed by our mixture-of-transformer backbone with modality-specific output heads. |
+| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | None. |
+| Technical Limitations | The model may not follow text, image, video, audio, or action trajectory inputs accurately in challenging cases, especially where the input contains complex scene composition, unusual camera motion, multiple interacting agents, low lighting, high motion blur, or fine-grained physical interactions. Generated outputs may contain temporal inconsistency, object morphing, inaccurate 3D structure, or implausible physical dynamics. Generated audio may not accurately render intelligible speech, or maintain strict temporal and semantic alignment with the visual context. |
+| Verified to have met prescribed NVIDIA quality standards | Yes. |
+| Performance Metrics | Video generation is measured using PAIBench-G, RBench, PhysicsIQ, and Artifical Analysis Image2Video benchmark. Image generation uses UniGenBench and Artifical Analysis Text2Image benchmark. For transfer evaluation, we use PAIBench-C and AVBench-C. Audio generation uses internal benchmarks. Action prediction uses metrics such as action MSE, Absolute Translation Error, Relative Translation Error, Relative Rotation Error, PSNR, and robotic task completion success rate. |
+| Potential Known Risks | This model can generate synthetic media and may produce content that is offensive, unsafe, misleading, indecent, or unsuitable for a target deployment. Users should implement robust safety guardrails — including content filtering, abuse monitoring, and access controls — to reduce the risk of harmful outputs. Users are responsible for ensuring that their use of the model complies with all applicable laws and regulations, and for regularly reviewing and updating their guardrails as risks evolve. |
+| Licensing | [OpenMDW1.1](https://openmdw.ai/)  |

PRIVACY.md ADDED Viewed

	@@ -0,0 +1,6 @@

+## Privacy
+| Privacy Information |
+|---|
+| The model was trained on large-scale publicly available data that may contain images, audio-video, and text relating to people. NVIDIA collected and used this data in compliance with applicable data protection and privacy laws. This model was not designed to derive insights or otherwise learn from any personal data contained in the datasets. |
+| NVIDIA uses a combination of filters, data minimization techniques, and other guardrails to help prevent personal data from being recited by our models. We employ automated tools and data processing techniques during pre-training or training to identify and filter certain categories of personal data. For example, for text-bearing source and document components, our automated tools identified potential personal data such as person names, locations, and possible business or public-facing contact information such as email addresses and phone numbers.  We reviewed and removed any verified instances of personal data through a combination of automated filtering and human-in-the-loop validation. |
+| Please review NVIDIA's [Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) for more information. |

README.md ADDED Viewed

	@@ -0,0 +1,505 @@

+---
+license: other
+license_name: openmdw1.1-license
+license_link: >-
+  https://openmdw.ai/license/1-1/
+library_name: cosmos
+tags:
+  - nvidia
+  - cosmos
+  - cosmos3
+  - vllm-omni
+  - diffusers
+  - image-to-video
+  - video-generation
+---
+# **Cosmos 3: Omnimodal World Models for Physical AI**
+**[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)**
+[NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
+# Model Overview: Cosmos3-Super-Image2Video
+## Description
+Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.
+This model is ready for commercial and non-commercial use.
+**Model Developer:** NVIDIA
+### Model Versions
+- Cosmos3-Nano:
+  - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
+- Cosmos3-Super:
+  - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
+- Cosmos3-Nano-Policy-DROID:
+  - Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.
+- Cosmos3-Super-Image2Video:
+  - Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.
+- Cosmos3-Super-Text2Image:
+  - Given text input, generate high-fidelity images that are consistent with the provided description.
+### License
+This model is released under the [OpenMDW1.1](https://openmdw.ai/license/1-1/)
+### Deployment Geography
+Global
+### Use Case
+Physical AI: Encompassing robotics, autonomous vehicles (AV), and smart space environments, including industrial and factory-scale applications.
+### Release Date
+Hugging Face 05/31/2026 via [https://huggingface.co/collections/nvidia/cosmos3](https://huggingface.co/collections/nvidia/cosmos3)
+GitHub 05/31/2026 via [https://github.com/nvidia/cosmos](https://github.com/nvidia/cosmos)
+## Model Architecture
+**Architecture Type:** Transformer
+**Network Architecture:** Mixture-of-Transformers (MoT)
+Cosmos3 is an Omni-modal foundation model built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. During inference, text is generated through standard next-token autoregressive decoding, while non-text modalities, such as images, video, audio, and actions, are synthesized through iterative denoising. This unified architecture enables Cosmos3 to model heterogeneous modalities within a single framework while preserving generation mechanisms best suited to each modality.
+**This model was developed based on:**  [Cosmos Framework](https://github.com/nvidia/cosmos-framework)
+**Number of trainable model parameters:**
+- Cosmos3-Nano: 16B
+- Cosmos3-Super: 64B
+- Cosmos3-Nano-Policy-DROID: 16B
+- Cosmos3-Super-Image2Video: 64B
+- Cosmos3-Super-Text2Image: 64B
+## Input/Output Specifications
+- **Generator Input**
+  - **Input Type(s)**: Text, Image, Video (with audio or without audio), Action Trajectory
+  - **Input Format(s)**:
+    - Text: String
+    - Image: jpg, png, jpeg, webp
+    - Video (with or without audio): mp4
+    - Action: json (1D list)
+  - **Input Parameters**:
+    - Text: One-dimensional (1D)
+    - Image: Two-dimensional (2D)
+    - Video: Three-dimensional (3D)
+    - Audio: One-dimensional (1D)
+    - Action trajectory: One-dimensional (1D)
+  - **Other Properties Related to Input**:
+    - For video inputs, we accept various resolutions, including 720p, 480p, and 256p.
+    - When using input video with audio muxed into the video MP4 file, the audio should have 2 channels (stereo) and a 48 kHz sample rate.
+    - Image and video inputs are RGB color (8 bits per channel, sRGB color space); grayscale inputs are not supported.
+    - Action input is a per-frame sequence of robot/agent state or control values (e.g., joint positions, gripper state, camera pose). The full input is a 2D array shaped (T, D), where T is the number of frames and D is the embodiment-specific dimensionality listed below.
+    - Input action is only supported for compatible embodiments, including general camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single Franka Panda arm with RobotiQ gripper (10D), dual Franka Panda arm with RobotiQ gripper (20D), Agibot (29D), UR (10D), Google robot (10D), WidowX 250 (10D), UMI (9D).
+  - **Input Size and Length limits:**
+    - **Text:** 4096 tokens
+    - **Image:** 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16)
+    - **Video:** 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16). Max number of frames = 5.
+    - **Audio:** Max 0.5 second
+    - **Action:** 16 – 400 video frames
+- **Generator Output**
+  - **Output Type(s)**: Image, video, audio, action, text
+  - **Output Format(s)**:
+    - Image: JPG
+    - Video: MP4
+    - Audio: Advanced Audio Coding (AAC) stream (muxed within the MP4)
+    - Action: 1D list (.json)
+    - Text: string
+  - **Output Parameters**:
+    - Image: Two-dimensional (2D)
+    - Video: Three-dimensional (3D)
+    - Audio: One-dimensional (1D)
+    - Action: One-dimensional (1D)
+    - Text: One-dimensional (1D)
+  - **Other Properties Related to Output**:
+    - The generated video is an MP4 file, with the resolution, frame rate, and duration specified in the input. The generated audio is encoded in AAC format, muxed into the video MP4 file with 2 channels (stereo) and a 48 kHz sample rate.
+    - Video generation supports durations from 5 to 400 frames, with 189 frames as the default generation duration.
+    - The generated action is only supported for compatible embodiments, including general camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single Franka Panda arm with RobotiQ gripper (10D), dual Franka Panda arm with RobotiQ gripper (20D), Agibot (29D), UR (10D), Google robot (10D), WidowX 250 (10D), UMI (9D).
+    - Audio: 48 kHz stereo AAC stream muxed into video mp4
+    - Video: mp4 at the FPS specified in input
+    - Image: JPEG
+- **Reasoner Input**
+  - **Input Type(s)**: Text, Text+Image, Text+Video
+  - **Input Format(s)**:
+    - Text: String
+    - Image: jpg, png, jpeg, webp
+    - Video: mp4
+  - **Input Parameters**:
+    - Text: One-dimensional (1D)
+    - Image: Two-dimensional (2D)
+    - Video: Three-dimensional (3D)
+  - **Other Properties Related to Input**:
+    - Video inputs are recommended at a frame rate of 4 fps.
+    - Long-context inputs supported up to 256K tokens.
+  - **Input Size and Length limits:**
+    - **Text:** Up to 256K tokens (context window).
+    - **Image:** Standard input image formats; passed as file or URL.
+    - **Video:** mp4 at the recommended 4 fps.
+- **Reasoner Output**
+  - **Output Type(s)**: Text
+  - **Output Format(s)**:
+    - Text: string
+  - **Output Parameters**:
+    - Text: One-dimensional (1D)
+  - **Other Properties Related to Output**:
+    - Default `max_tokens=4096+` is recommended for reasoning outputs; longer outputs may be requested.
+    - Reasoning outputs may include structured chain-of-thought, 2D/3D point localization, and bounding-box coordinates for vision-based tasks.
+The video content visualizes the input text description as a short animated scene, capturing key elements within the specified time constraints.
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+## Software Integration
+**Runtime Engine(s):**
+- [PyTorch](https://github.com/nvidia/cosmos3)
+- [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
+- [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
+**Supported Hardware Microarchitecture Compatibility:**
+- NVIDIA Ampere
+- NVIDIA Blackwell
+- NVIDIA Hopper
+**Operating System(s):**
+- Linux (We have not tested on other operating systems.)
+**Note:** Only BF16 precision is tested. Other precisions like FP4, FP8, and FP16 are not officially supported.
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+## Training, Testing, and Evaluation Datasets
+### Dataset Overview
+- **Total Size:** 1.3B data points
+- **Total Number of Datasets:** 393 dataset entries
+- **Dataset partition:** Training [100%], Testing [N/A — evaluation benchmarks used separately], Validation [N/A — evaluation benchmarks used separately]
+- **Time period for training data collection:** 2024–2026
+- **Time period for testing data collection:** N/A (standard public benchmarks)
+- **Time period for validation data collection:** N/A (standard public benchmarks)
+Raw data from internal and external sources is transformed into training-ready data through multiple stages of curation, filtering, and quality review. Data acquisition spans diverse multimodal sources — robotics, autonomous driving, industrial environments, indoor and outdoor scenes, varied lighting and weather conditions, camera viewpoints, object categories, and human activities — to broaden coverage across Physical AI operating environments. Automated filtering pipelines remove corrupted, duplicate, low-quality, and restricted content. Metadata analysis, heuristic rules, and model-assisted classifiers are applied during preprocessing to flag anomalous distributions and low-diversity subsets. Human review supplements automated filtering for selected datasets, benchmark construction, and targeted quality analysis. Datasets are balanced across modalities and task categories — visual reasoning, text-to-image, text-to-video, image-to-video, audio generation, video transfer, action-conditioned generation, and action command generation — to reduce overrepresentation of narrow domains. Synthetic and simulation-based augmentation supplements coverage of rare physical interactions and edge-case scenarios. Deduplication and provenance tracking are applied across the corpus. The resulting processed data is converted into model-ready tokenized or encoded representations through modality-specific preprocessors before training begins.
+Training datasets passed through multiple layers of automated and manual safeguards designed to reduce the presence of harmful or policy-violating content across categories including weapons and weapons-related instructional content, criminal planning, child sexual abuse material (CSAM), non-consensual intimate imagery (NCII), sexual content involving minors, harassment, hate speech, profanity, threats and incitement to violence, self-harm or suicide-related content, and graphic violence. Data sources are reviewed for licensing compatibility, provenance, and alignment with internal data governance and safety policies before admission into training corpora. Automated filtering pipelines combine multiple detection strategies: hash-matching against known CSAM and NCII reference databases; classifier-based moderation models trained for explicit sexual content, hate speech, violence, weapons imagery, and other restricted categories; keyword and regex-based screening for criminal-planning, threats, and self-harm phrases in text data; metadata and provenance heuristics for source-level risk signals; and embedding-based anomaly detection to surface samples that fall outside expected distributions. Human review and targeted audits supplement automated filtering for selected datasets, benchmark construction, and safety-sensitive evaluation. For multimodal Physical AI data (robotics, autonomous driving, industrial scenes), additional filtering targets invalid action trajectories, physically implausible interactions, and unsafe control sequences. Synthetic and simulation-generated data are evaluated through internal validation before inclusion. Benchmark evaluations and red-team testing are applied post-training to surface remaining safety gaps across world generation, reasoning, audio, and action tasks. No large-scale data-filtering process can guarantee complete removal of all harmful content; residual risks may remain, particularly in rare edge cases or open-world deployment settings. Ongoing monitoring and dataset review continue post-release.
+**Data Modality and Training Data Size**
+| Modality | Reasoning Data Sample Count | Generation Data Sample Count |
+| -------- | ------------------- | -------------------- |
+| Text     | 22M                 | Not Applicable       |
+| Image    | 19M                 | 767M                 |
+| Video    | 1M                  | 348M                 |
+| Audio    | Not Applicable      | 139M                 |
+| Action   | Not Applicable      | 8M                   |
+**Data Collection Method by dataset**
+- Hybrid: Automatic/Sensors, Synthetic, Automated
+**Labeling Method by dataset**
+- Hybrid: Human, Automated
+**Properties:** The training, testing, and evaluation datasets consist of diverse multimodal video, image, audio, action, synthetic, and sensor-conditioned data sourced from NVIDIA-owned data and publicly available, commercially permissive datasets. These datasets are curated to exclude known restricted content and to support building an Omni model that learns to generate and reason about dynamic physical environments across world reasoning and generation tasks.
+### Public Datasets
+| Dataset&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Samples&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |
+|---|---|
+| OpenImage | 1.2M |
+| Coyo700M | 100M |
+| YouTube Video | 340M |
+| UMI | 4.5M |
+### Private Datasets
+| Dataset&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Samples&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |
+|---|---|
+| Egocentric | 7M |
+| Nexar | 0.6M |
+| AgiBot | 0.2M |
+| HOI | 0.3M |
+### Synthetic Datasets
+| Dataset | Samples |
+|---|---|
+| synthetic images generated using HiDream-I1 | 15M |
+| synthetic images generated using Qwen-Image-2512 | 14M |
+| synthetic captions generated using Qwen3-VL | 1115M |
+## Evaluation Datasets
+**Data Collection Method by dataset**
+- Hybrid: Automatic/Sensors, Synthetic, Automated
+**Labeling Method by dataset**
+- Hybrid: Human, Automated
+**Properties:** The training, testing, and evaluation datasets consist of diverse multimodal video, image, audio, action, synthetic, and sensor-conditioned data sourced from NVIDIA-owned data and publicly available, commercially permissive datasets. These datasets are curated to exclude known restricted content and to support building an Omni model that learns to generate and reason about dynamic physical environments across world reasoning and generation tasks.
+## Benchmarks
+Please see our [technical paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf) for detailed evaluations of the base model.
+### Artificial Analysis Leaderboard
+#### Open-Source Models [2026/05/28/]
+![Artificial Analysis Image-to-Video leaderboard (no audio) — open-source models](assets/benchmark-image2video-leaderboard.png)
+#### All Models [2026/05/28/] (Including Closed-Source)
+![Artificial Analysis Image-to-Video leaderboard (no audio) — all models including closed-source](assets/benchmark-image2video-leaderboard-all-models.png)
+## Usage
+- See [Cosmos](https://github.com/nvidia/cosmos) for details.
+### Prompt upsampling
+For optimal quality, text prompts should be upsampled into a specific JSON structure. Description and code can be found [here](https://github.com/nvidia/cosmos-framework/blob/main/docs/prompt_upsampling.md).
+For example, for prompt upsampling using Opus-4.7:
+```bash
+git clone https://github.com/NVIDIA/cosmos-framework.git packages/cosmos-framework
+pip install -e packages/cosmos-framework
+export PROMPT_UPSAMPLER_ENDPOINT_URL="https://api.anthropic.com/v1/"
+export PROMPT_UPSAMPLER_MODEL_NAME="claude-opus-4-7"
+export PROMPT_UPSAMPLER_API_TOKEN="<you_token>"
+python -m cosmos_framework.inference.prompt_upsampling \
+    --input assets/example_original_prompt.txt \
+    --image-url assets/example_first_frame.png \
+    --output /tmp/upsampled_posttrain_i2v/ \
+    --mode posttrain_image2video \
+    --endpoint-url "${PROMPT_UPSAMPLER_ENDPOINT_URL}" \
+    --model "${PROMPT_UPSAMPLER_MODEL_NAME}" \
+    --api-token "${PROMPT_UPSAMPLER_API_TOKEN}" \
+    --resolution 480 \
+    --aspect-ratio "16,9" \
+    --duration 7s
+```
+The JSON-upsampled version of `assets/example_original_prompt.txt` is saved in `assets/example_prompt.json` for convenience, and is used for the video generation examples below.
+### vLLM-Omni
+#### Container
+```
+docker pull vllm/vllm-omni:cosmos3
+```
+#### General Invocation
+You can use the release-tested `vllm-omni` package for deploying an OpenAI-compatible API inference endpoint.
+The recommended vLLM-Omni serving configuration for nvidia/Cosmos3-Super-Image2Video on 8xH200, 8xH100, or 8xA100 is:
+```bash
+vllm serve nvidia/Cosmos3-Super-Image2Video \
+  --omni \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --cfg-parallel-size 2 \
+  --ulysses-degree 4 \
+  --use-hsdp \
+  --hsdp-shard-size 8 \
+  --init-timeout 1800
+```
+With this configuration, video generation with 50 steps should take approximately 55 seconds on H200 GPUs. For 2xH200, one can simply use `--cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2`, and a video should take approximately 3 minutes to generate. Tensor parallelism is also supported by setting `--tensor-parallel-size`. Setting `--enable-layerwise-offload` can help reduce memory usage on GPUs with less available memory.
+#### Download example prompts and scripts
+The inference scripts (`scripts/`) and example inputs (`assets/`) live in this model repo. Download just those folders with the Hugging Face CLI:
+```bash
+pip install -U "huggingface_hub[cli]"
+hf download nvidia/Cosmos3-Super-Image2Video scripts/ assets/ \
+    --local-dir Cosmos3-Super-Image2Video
+cd Cosmos3-Super-Image2Video
+```
+Run all commands below from the downloaded repo root.
+#### Example: image to video generation
+Generate a video from a first-frame image and a JSON-format prompt by calling a vLLM-Omni endpoint:
+```bash
+python scripts/gen_video.py \
+    --endpoint <endpoint-url> \
+    --prompt-file assets/example_prompt.json \
+    --image-path assets/example_first_frame.png \
+    --output-path scripts/output.mp4
+```
+Or, as a minimal standalone script:
+```python
+import json
+import requests
+# 1. Read JSON-upsampled prompt (prompt + negative_prompt)
+json_prompt = json.load(open("assets/example_prompt.json"))
+# 2. Build your API payload
+payload = {
+    "prompt": json_prompt["prompt"],
+    "negative_prompt": json_prompt["negative_prompt"],
+    "size": "832x480",
+    "num_frames": 189,
+    "fps": 24,
+    "num_inference_steps": 50,
+    "guidance_scale": 6.0,
+    "flow_shift": 5.0,
+    "extra_params": json.dumps(
+        {
+          "use_resolution_template": False,
+          "use_duration_template": False,
+          "guardrails": True,
+        }
+    ),
+}
+files = {"input_reference": ("input.png", open("assets/example_first_frame.png", "rb"), "image/png")}
+# 3. Send the POST request
+url = "http://localhost:8000/v1/videos/sync"
+print("Sending request to server...")
+response = requests.post(url, data=payload, files=files, headers={"Accept": "video/mp4"})
+response.raise_for_status()
+# 4. Save the returned MP4 bytes
+with open("/tmp/cosmos3_i2v.mp4", "wb") as video_file:
+    video_file.write(response.content)
+print("Saved video to /tmp/cosmos3_i2v.mp4")
+```
+Example output generated from `assets/example_first_frame.png`:
+<video controls width="832" height="480" src="https://huggingface.co/nvidia/Cosmos3-Super-Image2Video/resolve/main/assets/example_output.mp4"></video>
+#### Inferencing with custom prompts
+Cosmos3-Super-Image2Video uses JSON-format prompts for optimal quality. The recommended way is to utilize [cosmos-framework](#prompt-upsampling). Here we provide a simple proof-of-concept script for convenience. It requires an OpenAI-compatible VLM model like `claude-opus-4.7` and `gpt-5.5`.
+```bash
+export PROMPT_UPSAMPLER_API_KEY="..."
+python scripts/upsample_prompt.py \
+    --model-name <model> \
+    --base-url <VLM-endpoint-url> \
+    --image-path assets/example_first_frame.png \
+    --user-prompt "The ice cream melts and gradually disappears. The camera moves around." \
+    --output-path scripts/upsampled.json
+```
+### Diffusers
+Cosmos3 is fully supported within the popular HuggingFace Diffusers package. This integration makes it a supported inference backend, allowing developers to easily incorporate Cosmos3's capabilities - such as text-to-image generation - into their pipelines using the Cosmos3OmniPipeline class, as demonstrated by the provided code examples (see examples for other modalities on the HuggingFace Cosmos3 page).
+#### Installation
+To install diffusers with Cosmos3OmniPipeline:
+```
+uv venv --python 3.13 --seed --managed-python
+source .venv/bin/activate
+uv pip install \
+  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
+  accelerate \
+  av \
+  cosmos_guardrail \
+  huggingface_hub \
+  imageio \
+  imageio-ffmpeg \
+  torch \
+  torchvision \
+  transformers
+```
+#### Example: image to video generation with Diffusers
+The following example generates a video in approximately 170 seconds on a single GB200.
+```python
+import json
+import torch
+from diffusers import Cosmos3OmniPipeline, UniPCMultistepScheduler
+from diffusers.utils import export_to_video, load_image
+pipe = Cosmos3OmniPipeline.from_pretrained(
+    "nvidia/Cosmos3-Super-Image2Video",
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",
+    enable_safety_checker=True,
+)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=5.0)
+image = load_image("assets/example_first_frame.png")
+# JSON-format prompt (see scripts/upsample_prompt.py to build your own).
+spec = json.load(open("assets/example_prompt.json"))
+prompt = spec["prompt"]
+negative_prompt = spec["negative_prompt"]
+result = pipe(
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    image=image,
+    num_frames=189,
+    height=480,
+    width=832,
+    fps=24.0,
+    num_inference_steps=50,
+    guidance_scale=6.0,
+    add_resolution_template=False,
+    add_duration_template=False,
+)
+export_to_video(result.video, "output.mp4", fps=24, quality=7, macro_block_size=1)
+```
+Example output generated by Diffusers:
+<video controls width="832" height="480" src="https://huggingface.co/nvidia/Cosmos3-Super-Image2Video/resolve/main/assets/example_output_diffusers.mp4"></video>
+## Limitations
+Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
+Cosmos3 outputs should not be treated as physically accurate simulation, reliable ground-truth reasoning, or safety-certified decision making. Applications involving robotics control, autonomous systems, scientific simulation, or safety-critical planning require additional validation, external constraints, system-level safety analysis, and domain-specific guardrails before deployment.
+## Inference
+**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
+**Test Hardware:** GB200 and H100
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
+Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
+For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](EXPLAINABILITY.md), [Bias](BIAS.md), [Safety & Security](SAFETY.md), and [Privacy](PRIVACY.md) subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

SAFETY.md ADDED Viewed

	@@ -0,0 +1,11 @@

+## Safety & Security
+| Field | Response |
+| :---- | :---- |
+| Model Application(s) | World reasoning and generation for Physical AI. |
+| Describe the life critical impact: | This model is not a safety-certified component and must not be used as the sole basis for life-critical decisions or control without additional system-level validation, safety analysis, and safeguards. The model is not designed or tested by NVIDIA for use in any system or application where the use of or failure of such system or application developed with the model could result in injury, death, or catastrophic damage. NVIDIA is not liable to any party, in whole or in part, for any claims or damages arising from those uses. Any system or application developed with the model must include sufficient safety and redundancy features and comply with applicable legal and regulatory standards and requirements. |
+| Description of methods implemented in data acquisition or processing, if any, to address other types of potentially harmful data in the training, testing, and validation data:  | Training, evaluation, and validation datasets pass through multi-stage automated and manual filtering to reduce harmful, unsafe, restricted, or policy-violating content. Pipelines include source-licensing review, deduplication, metadata-based and classifier-based moderation, embedding-based anomaly detection, and human audits on selected datasets. For Physical AI data (robotics, autonomous driving, industrial scenes), filtering also targets invalid action trajectories, physically implausible interactions, and unsafe control sequences. Synthetic and simulation-generated data are evaluated through internal validation before inclusion. Benchmark and red-team testing surface remaining safety gaps across world generation, reasoning, audio, and action tasks. No data-filtering process can guarantee complete removal; developers are responsible for application-specific safeguards and validation before deployment. |
+| Description of any methods implemented in data acquisition or processing, if any, to address illegal or harmful content in the training data, including, but not limited to, child sexual abuse material (CSAM) and non-consensual intimate imagery (NCII)  | In addition to the general unsafe-content filtering described above, training data acquisition and preprocessing apply CSAM- and NCII-specific safeguards: hash-matching systems against known CSAM databases, classifier-based moderation models trained specifically for explicit content and NCII detection, and provenance and licensing review for sources containing human imagery. Identified content is removed at ingest, with human review and targeted audits supplementing automated filtering for selected datasets. Despite these safeguards, no large-scale data-filtering system can guarantee complete detection. Ongoing monitoring and dataset review continue post-release. |
+| Use Case Restrictions | Use is governed by the [OpenMDW1.1](https://openmdw.ai/) |
+| Model and dataset restrictions | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
+| Responsible Data Handling | This AI model was developed based on our policies to ensure responsible data handling and risk mitigation. The datasets used for training have been scanned for harmful content and illegal content, consistent with our policies including scanning for Child Sexual Abuse Material (CSAM). Ongoing review and monitoring mechanisms are in place based on our policies and to maintain data integrity. |

assets/benchmark-image2video-leaderboard-all-models.png ADDED Viewed

Git LFS Details

SHA256: 185bfd9042b58c449b3328833193cb1a51782fbe49e492039eeb637e9f6c0fb6
Pointer size: 131 Bytes
Size of remote file: 522 kB

assets/benchmark-image2video-leaderboard.png ADDED Viewed

Git LFS Details

SHA256: 265cdc6ab57ceb15c60356fa0a328299aead5dfd43c32ae354a0aa6fa03bc72f
Pointer size: 131 Bytes
Size of remote file: 419 kB

assets/example_first_frame.png ADDED Viewed

Git LFS Details

SHA256: 677259954dfc05b6dd62fbc1d8c4064544e247f86d0df458214e6e8097341988
Pointer size: 132 Bytes
Size of remote file: 1.96 MB

assets/example_original_prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ The ice cream melts and gradually disappears. The camera moves around.

assets/example_output.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c7df4dc671a3237c33aed82c47e1840b403c43b11ecbb6987d34284f3bf6cbe
+size 6948901

assets/example_output_diffusers.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8c60550d54d9ce3cec15a00c4c484da930ee020510caa02323edeb3e6528173
+size 3987298

assets/example_prompt.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"prompt": "{\"temporal_caption\": \"A fallen waffle cone lies on rough sunlit asphalt in an extreme low close-up, with a rounded scoop of vanilla and chocolate ice cream pressed against the road, a small melted puddle already spreading beneath it, and dry autumn leaves scattered around in warm late-afternoon light. The viewpoint stays near ground level and begins a slow, smooth arc around the cone from left to right, keeping the melting scoop dominant while the background street and leaves remain softly blurred. As the sun warms the ice cream, the glossy edges soften first, thin rivulets of vanilla and chocolate slide down the curved scoop, and the existing puddle widens into the cracks and pebbled texture of the asphalt. The waffle cone remains mostly rigid but grows slightly damp at the rim touching the ice cream, while the scoop loses its rounded shape, slumps lower, and exposes more of the cone’s open mouth. The moving viewpoint continues its gentle orbit, revealing the chocolate side thinning into streaks and the vanilla side collapsing into pale liquid that creeps outward under gravity. By the end, most of the ice cream has flattened into a shallow glossy stain that drains into small road fissures and spreads out of the immediate area, leaving the cone lying in place with only thin cream-colored and brown traces clinging to the asphalt in the warm light.\", \"duration\": \"7s\", \"fps\": 24.0, \"resolution\": {\"H\": 480, \"W\": 832}, \"aspect_ratio\": \"16,9\"}", "negative_prompt": "The video captures a series of frames showing macroblocking artifacts, chromatic aberration, high-frequency noise, and rolling shutter distortion. It includes static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, bit-depth compression artifacts, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, hard cut, visual noise, and flickering. It features moiré patterns, edge halos, and temporal aliasing. Furthermore, the content defies common sense, generating illogical scenarios, nonsensical entities, absurd character behaviors, and conceptual paradoxes that violate basic human reasoning and everyday reality. The video looks like a surreal or glitchy hallucination. Overall, the video is of poor quality."}

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' }}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set image_count = namespace(value=0) %}\n{%- set video_count = namespace(value=0) %}\n{%- for message in messages %}\n    {%- if message.role == \"user\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content_item in message.content %}\n                {%- if 'text' in content_item %}\n                    {{- content_item.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and message.content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
+}

checkpoint.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "checkpoint_cache_dir": null,
+  "checkpoint_hf": null,
+  "checkpoint_path": "/lustre/fsw/portfolios/cosmos/projects/cosmos_base_training/users/zhao/tmp/i4_checkpoint",
+  "checkpoint_type": "dcp",
+  "config_file": "/lustre/fsw/portfolios/cosmos/projects/cosmos_base_training/users/zhao/imaginaire4/cosmos3/configs/model/Cosmos3-Super-Image2Video.yaml",
+  "config_file_type": "yaml",
+  "credential_path": "credentials/gcp_checkpoint.secret",
+  "experiment": "",
+  "experiment_overrides": [
+    "model.config.diffusion_expert_config.load_weights_from_pretrained=False",
+    "model.config.vlm_config.pretrained_weights.enabled=False",
+    "checkpoint.load_from_object_store.enabled=False"
+  ],
+  "model_memory_bytes": null,
+  "use_ema_weights": true
+}

config.json ADDED Viewed

	@@ -0,0 +1,244 @@

+{
+  "allow_patterns_overrides": [
+    "*/*.safetensors"
+  ],
+  "architectures": [
+    "Cosmos3ForConditionalGeneration"
+  ],
+  "image_token_id": 151655,
+  "model": {
+    "_recursive_": false,
+    "_target_": "projects.cosmos3.vfm.models.omni_mot_model.OmniMoTModel",
+    "config": {
+      "_type": "projects.cosmos3.vfm.configs.base.defaults.model_config.OmniMoTModelConfig",
+      "action_gen": false,
+      "activation_checkpointing": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.activation_checkpointing.ActivationCheckpointingConfig",
+        "determinism_check": "default",
+        "mode": "full",
+        "preserve_rng_state": true,
+        "save_ops_regex": [
+          "fmha"
+        ]
+      },
+      "causal_training_strategy": "none",
+      "compile": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.compile.CompileConfig",
+        "compile_dynamic": true,
+        "compiled_region": "language",
+        "coordinate_descent_tuning": false,
+        "enabled": true,
+        "max_autotune_pointwise": false,
+        "use_cuda_graphs": false
+      },
+      "diffusion_expert_config": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.model_config.DiffusionExpertConfig",
+        "base_fps": 16,
+        "enable_fps_modulation": true,
+        "load_weights_from_pretrained": true,
+        "max_vae_latent_side_after_patchify": 20,
+        "patch_spatial": 2,
+        "position_embedding_type": "unified_3d_mrope",
+        "rope_h_extrapolation_ratio": 1.0,
+        "rope_t_extrapolation_ratio": 1.0,
+        "rope_w_extrapolation_ratio": 1.0,
+        "timestep_range": 1.0,
+        "unified_3d_mrope_reset_spatial_ids": true,
+        "unified_3d_mrope_temporal_modality_margin": 15000
+      },
+      "ema": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.ema.EMAConfig",
+        "enabled": false,
+        "iteration_shift": 0,
+        "rate": 0.1
+      },
+      "fixed_step_sampler_config": null,
+      "input_caption_key": "ai_caption",
+      "input_image_key": "images",
+      "input_video_key": "video",
+      "joint_attn_implementation": "two_way",
+      "latent_downsample_factor": 16,
+      "lbl": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.model_config.LBLConfig",
+        "coeff_gen": null,
+        "coeff_und": null,
+        "method": "local"
+      },
+      "log_enc_time_every_n": 100,
+      "lora_alpha": 32,
+      "lora_enabled": false,
+      "lora_rank": 16,
+      "lora_target_modules": "q_proj_moe_gen,k_proj_moe_gen,v_proj_moe_gen,o_proj_moe_gen",
+      "max_action_dim": 32,
+      "max_num_tokens_after_packing": 45056,
+      "natten_parameter_list": null,
+      "net": null,
+      "num_embodiment_domains": 32,
+      "parallelism": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.parallelism.ParallelismConfig",
+        "cfg_parallel_shard_degree": 1,
+        "context_parallel_shard_degree": 1,
+        "data_parallel_replicate_degree": 1,
+        "data_parallel_shard_degree": 16,
+        "enable_inference_mode": false,
+        "fsdp_master_dtype": "float32"
+      },
+      "precision": "bfloat16",
+      "rectified_flow_inference_config": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.model_config.RectifiedFlowInferenceConfig",
+        "num_train_timesteps": 1000,
+        "scheduler_type": "unipc",
+        "shift": 1,
+        "use_dynamic_shifting": false
+      },
+      "rectified_flow_training_config": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.model_config.RectifiedFlowTrainingConfig",
+        "action_loss_weight": 10.0,
+        "high_sigma_ratio": 0.05,
+        "high_sigma_timesteps_max": 1000,
+        "high_sigma_timesteps_min": 995,
+        "image_loss_scale": null,
+        "independent_action_schedule": false,
+        "independent_sound_schedule": false,
+        "loss_scale": 1.0,
+        "normalize_loss_by_active": false,
+        "shift": {
+          "256": 1,
+          "480": 3,
+          "704": 5,
+          "720": 5
+        },
+        "shift_action": null,
+        "shift_sound": null,
+        "sound_loss_scale": null,
+        "train_time_action_distribution": "logitnormal",
+        "train_time_image_distribution": "logitnormal",
+        "train_time_sound_distribution": "logitnormal",
+        "train_time_video_distribution": "ltx2",
+        "train_time_weight": "uniform",
+        "use_discrete_rf": false,
+        "use_dynamic_shift": false,
+        "use_high_sigma_strategy": false,
+        "use_high_sigma_strategy_action": false,
+        "use_high_sigma_strategy_sound": false
+      },
+      "resolution": "480",
+      "sound_dim": null,
+      "sound_gen": false,
+      "sound_latent_fps": 25,
+      "sound_tokenizer": null,
+      "state_ch": 48,
+      "state_t": 300,
+      "tokenizer": {
+        "_target_": "projects.cosmos3.vfm.tokenizers.wan2pt2_vae_4x16x16.Wan2pt2VAEInterface",
+        "bucket_name": "bucket",
+        "causal": true,
+        "chunk_duration": 93,
+        "encode_bucket_multiple": null,
+        "encode_chunk_frames": {
+          "256": 68,
+          "480": 24,
+          "720": 12
+        },
+        "encode_exact_durations": null,
+        "keep_decoder_cache": false,
+        "object_store_credential_path_pretrained": "credentials/gcp_training.secret",
+        "spatial_compression_factor": 16,
+        "temporal_compression_factor": 4,
+        "temporal_window": null,
+        "use_streaming_encode": false,
+        "vae_path": "pretrained/tokenizers/video/wan2pt2/Wan2.2_VAE.pth"
+      },
+      "video_temporal_causal": false,
+      "vision_gen": true,
+      "vlm_config": {
+        "_type": "projects.cosmos3.vfm.configs.base.defaults.vlm.VLMConfig",
+        "layer_module": null,
+        "model_instance": {
+          "_target_": "projects.cosmos3.vfm.models.mot.unified_mot.Qwen3VLTextForCausalLM",
+          "config": {
+            "_target_": "projects.cosmos3.vfm.configs.base.defaults.vlm.create_vlm_config",
+            "base_config": {
+              "_target_": "projects.cosmos3.vfm.models.mot.unified_mot.Qwen3VLMoTConfig.from_json_file",
+              "json_file": "projects/cosmos3/vfm/models/vlm/qwen3_vl/configs/Qwen3-VL-32B-Instruct.json"
+            },
+            "qk_norm_for_text": true
+          }
+        },
+        "model_name": "nvidia/Cosmos3-Super-Reasoner",
+        "pretrained_weights": {
+          "_type": "projects.cosmos3.vfm.configs.base.defaults.vlm.PretrainedWeightsConfig",
+          "backbone_path": "s3://bucket/cosmos3/pretrained/huggingface/Cosmos-Reason/Cosmos3-Super-Reasoner-b6df0d1/",
+          "checkpoint_format": null,
+          "credentials_path": "credentials/gcp_checkpoint.secret",
+          "enable_gcs_patch_in_boto3": true,
+          "enabled": true
+        },
+        "qk_norm": false,
+        "tie_word_embeddings": false,
+        "tokenizer": {
+          "_target_": "projects.cosmos3.vfm.configs.base.defaults.vlm.create_qwen2_tokenizer_with_download",
+          "config_variant": "gcp",
+          "pretrained_model_name": "Qwen/Qwen3-VL-32B-Instruct"
+        },
+        "use_system_prompt": false
+      }
+    }
+  },
+  "model_type": "cosmos3_omni",
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 5120,
+    "initializer_range": 0.02,
+    "intermediate_size": 25600,
+    "max_position_embeddings": 262144,
+    "model_type": "qwen3_vl_text",
+    "num_attention_heads": 64,
+    "num_hidden_layers": 64,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default"
+    },
+    "rope_theta": 5000000,
+    "use_cache": true,
+    "vocab_size": 151936
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.0.dev0",
+  "video_token_id": 151656,
+  "vision_config": {
+    "deepstack_visual_indexes": [
+      8,
+      16,
+      24
+    ],
+    "depth": 27,
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1152,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4304,
+    "model_type": "qwen3_vl",
+    "num_heads": 16,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 5120,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+    "bos_token_id": 151643,
+    "pad_token_id": 151643,
+    "do_sample": true,
+    "eos_token_id": [
+        151645,
+        151643
+    ],
+    "top_p": 0.8,
+    "top_k": 20,
+    "temperature": 0.7,
+    "repetition_penalty": 1.0,
+    "transformers_version": "4.56.0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model_index.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "_class_name": "Cosmos3OmniDiffusersPipeline",
+  "_diffusers_version": "0.38.0",
+  "scheduler": [
+    "diffusers",
+    "UniPCMultistepScheduler"
+  ],
+  "text_tokenizer": [
+    "transformers",
+    "Qwen2TokenizerFast"
+  ],
+  "transformer": [
+    "diffusers",
+    "Cosmos3OmniTransformer"
+  ],
+  "vae": [
+    "diffusers",
+    "AutoencoderKLWan"
+  ],
+  "vision_encoder": [
+    "transformers",
+    "Qwen3VLVisionModel"
+  ]
+}

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "size": {
+        "longest_edge": 16777216,
+        "shortest_edge": 65536
+    },
+    "patch_size": 16,
+    "temporal_patch_size": 2,
+    "merge_size": 2,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "Qwen3VLProcessor",
+    "image_processor_type": "Qwen2VLImageProcessorFast"
+}

scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_class_name": "UniPCMultistepScheduler",
+  "_diffusers_version": "0.38.0",
+  "beta_end": 0.02,
+  "beta_schedule": "linear",
+  "beta_start": 0.0001,
+  "disable_corrector": [],
+  "dynamic_thresholding_ratio": 0.995,
+  "final_sigmas_type": "zero",
+  "flow_shift": 1.0,
+  "lower_order_final": true,
+  "num_train_timesteps": 1000,
+  "predict_x0": true,
+  "prediction_type": "flow_prediction",
+  "rescale_betas_zero_snr": false,
+  "sample_max_value": 1.0,
+  "shift_terminal": null,
+  "sigma_max": 200.0,
+  "sigma_min": 0.147,
+  "solver_order": 2,
+  "solver_p": null,
+  "solver_type": "bh2",
+  "steps_offset": 0,
+  "thresholding": false,
+  "time_shift_type": "exponential",
+  "timestep_spacing": "linspace",
+  "trained_betas": null,
+  "use_beta_sigmas": false,
+  "use_dynamic_shifting": false,
+  "use_exponential_sigmas": false,
+  "use_flow_sigmas": true,
+  "use_karras_sigmas": true
+}

scripts/gen_video.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""Minimal image-to-video generation against a vLLM-Omni endpoint (sync mode).
+Run from the Cosmos3-Super-Image2Video repo root:
+    python scripts/gen_video.py \
+        --endpoint <endpoint-url> \
+        --prompt-file assets/example_prompt.json \
+        --image-path assets/example_first_frame.png \
+        --output-path scripts/output.mp4
+"""
+import argparse
+import json
+from pathlib import Path
+import requests
+# Fixed generation settings: 16:9 480p, 189 frames @ 24 fps.
+ASPECT_RATIO = "16,9"
+WIDTH = 832
+HEIGHT = 480
+NUM_FRAMES = 189
+FPS = 24
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Generate one I2V sample (sync mode).")
+    parser.add_argument("--endpoint", required=True, help="vLLM-Omni endpoint base URL.")
+    parser.add_argument("--prompt-file", type=Path, default=Path("assets/example_prompt.json"))
+    parser.add_argument("--image-path", type=Path, default=Path("assets/example_first_frame.png"))
+    parser.add_argument("--output-path", type=Path, default=Path("scripts/output.mp4"))
+    args = parser.parse_args()
+    spec = json.loads(args.prompt_file.read_text(encoding="utf-8"))
+    # Safeguard the metadata and json format
+    prompt = json.loads(spec["prompt"])
+    prompt["duration"] = f"{int(NUM_FRAMES / FPS)}s"
+    prompt["fps"] = float(round(FPS))
+    prompt["resolution"] = {"H": HEIGHT, "W": WIDTH}
+    prompt["aspect_ratio"] = ASPECT_RATIO
+    data = {
+        "prompt": json.dumps(prompt, ensure_ascii=False),
+        "negative_prompt": spec["negative_prompt"],
+        "size": f"{WIDTH}x{HEIGHT}",
+        "num_frames": NUM_FRAMES,
+        "fps": FPS,
+        "num_inference_steps": 50,
+        "guidance_scale": 6.0,
+        "flow_shift": 5.0,
+        "extra_params": json.dumps({"use_resolution_template": False, "use_duration_template": False}),
+    }
+    files = {"input_reference": ("input.png", args.image_path.read_bytes(), "image/png")}
+    headers = {"Accept": "video/mp4", "User-Agent": "curl/8.5.0"}
+    response = requests.post(f"{args.endpoint}/v1/videos/sync", data=data, files=files, headers=headers, timeout=(10, 600))
+    response.raise_for_status()
+    args.output_path.parent.mkdir(parents=True, exist_ok=True)
+    args.output_path.write_bytes(response.content)
+    print(f"Saved video to {args.output_path} ({len(response.content) / (1024 * 1024):.1f} MB)")
+if __name__ == "__main__":
+    main()

scripts/upsample_prompt.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""Minimal image-to-video prompt upsampler.
+Run from the Cosmos3-Super-Image2Video repo root:
+    export PROMPT_UPSAMPLER_API_KEY="..."
+    python scripts/upsample_prompt.py \
+        --model-name <model> \
+        --base-url <VLM-endpoint-url> \
+        --image-path assets/example_first_frame.png \
+        --user-prompt "The dog flies into the outer space" \
+        --output-path scripts/upsampled.json
+"""
+import argparse
+import base64
+import json
+import mimetypes
+import os
+import re
+from pathlib import Path
+import requests
+# Fixed generation settings: 16:9 480p, 189 frames @ 24 fps.
+ASPECT_RATIO = "16,9"
+WIDTH = 832
+HEIGHT = 480
+NUM_FRAMES = 189
+FPS = 24
+MAX_TOKENS = 8192
+PROMPT_TEMPLATE = """You are an expert prompt engineer for an image-to-video generative model. You are given a STARTING FRAME image (the first frame of the video) and a USER INSTRUCTION describing the desired motion or changes to animate. Your task is to produce a dense, cinematic video description that the model will use to generate the full video, together with a customized negative prompt.
+Complete this task in two phases.
+---
+### PHASE 1: VIDEO DESCRIPTION
+Write a dense, narrative temporal_caption inside `<final_prompt>` XML tags, formatted as a JSON object using this exact template:
+{
+  "temporal_caption": "...",
+  "duration": "placeholder",
+  "fps": "placeholder",
+  "resolution": {
+    "H": "placeholder",
+    "W": "placeholder"
+  },
+  "aspect_ratio": "placeholder"
+}
+Rules for the temporal_caption:
+- The provided image is the exact starting frame - all described motion must be consistent with the starting frame.
+- Opening: Establish the scene — subjects, environment, lighting — describing what is directly visible in the starting frame accurately and faithfully, noting essential elements that the motion will directly involve, the subject's orientation (e.g., "facing away", "in three-quarter profile"), and any implied ongoing motion (e.g., a cyclist leaning into a curve, water already splashing) so the video continues smoothly. Phrase it naturally as a scene description (do not say "in the starting frame", "initially shown", or similar meta-references).
+- Motion: Describe the changes and actions in chronological order. Flow naturally from one action to the next. Advance time using natural conjunctions (e.g., "while," "as," "and").
+- Physical Accuracy: All motion must obey gravity and reflect realistic material behavior (e.g., cloth ripples, water splashes, rigid objects resist deformation).
+- Cause-and-effect: Always describe causes before their effects. Reflections, shadows, and secondary effects cannot appear on their own — the source object must first enter the frame or move into the relevant position before any reflection or shadow is described. E.g., a person must walk to the water's edge before their reflection appears on the surface; an object must strike the water before a splash erupts.
+- Object Permanence: Every subject must persist throughout or have a clear reason for entering or exiting. When a new subject not present in the starting frame is introduced (e.g., an opposing team, an arriving vehicle), briefly describe their appearance (e.g., uniform color, vehicle type and color) so the generator can render them consistently, and describe a logical way for them to come into the frame (e.g., entering from a specific side of the frame, walking in through a door, or emerging from behind an existing object) rather than having them appear out of nowhere.
+- Taboo Phrases: NEVER refer to the video medium itself. Avoid "the video shows...", "the scene...", "the clip...", "the frame...", "the camera shows...", "we see...".
+- Perspective: Describe human body sides from the subject's own perspective (e.g., "her right hand" = the subject's right hand) to avoid ambiguity. This applies whenever a body part enters or moves in the frame: always specify whether it is the left or right (e.g., "his right hand reaches in from the lower edge"), never a bare "a hand enters the frame".
+- Pronouns: Use singular pronouns ("he", "she", "him", "her", "it") or a singular noun phrase ("the person", "the rider", "the child") for single subjects. Never use "they"/"them"/"their" to refer to one person, as this can cause the model to render multiple subjects.
+- Spatial Phrasing: Use spatial relationships for motion (e.g., "enters from the left", "rises above the horizon") rather than camera-centric descriptions.
+- Camera: Include camera motion only if specified in the instruction; otherwise describe from a static viewpoint. Keep any described camera movement subtle and gradual — do not exaggerate altitude loss, tilt angle, or speed beyond what is minimally implied by the instruction. Do not use the word "transition" when describing camera motion.
+- Cinematography Terms: When the instruction references a lens, camera, or filming technique (e.g., "probe lens", "macro lens", "fisheye", "drone shot", "GoPro"), treat it as a cinematographic style describing how the footage is captured — never as a physical object visible in the scene. Mention the style (e.g., for a probe lens: extreme close shot; for a fisheye lens: extreme wide angle fisheye view) rather than mentioning the lens or camera apparatus itself.
+- Timelapse: If the instruction implies timelapse, explicitly use the word "timelapse" in the caption and avoid exaggerating its effects.
+- Cuts & Montages: Always describe a single continuous shot with no hard cuts unless the user instruction explicitly used words like "cut", "hard cut", "jump cut", "shot change", or "montage". When multiple shots are requested without specifying an exact number, describe at most 3 shots, and dedicate the majority of the description to the opening action before any cut. Never use phrases like "the first shot", "the opening shot", or number shots as "first", "second", etc. — simply describe the action directly.
+- Tone: Neutral, objective, descriptive. No opinions, value judgments, or inferred emotions unless physically observable.
+- Length & Format: Write exactly ONE coherent paragraph of 5-8 sentences. No bullet points or lists.
+USER INSTRUCTION:
+"{nl_description}"
+---
+### PHASE 2: NEGATIVE PROMPT
+Using your final video description from Phase 1, create a customized negative prompt.
+HOW IT WORKS:
+A negative prompt describes exactly what a bad video looks like. Use declarative statements (e.g., "blurry faces"). Never use negative instructions like "avoid" or "do not".
+---
+DEFAULT NEGATIVE PROMPT:
+The video captures a series of frames showing macroblocking artifacts, chromatic aberration, high-frequency noise, and rolling shutter distortion. It includes static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, bit-depth compression artifacts, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, hard cut, visual noise, and flickering. It features moiré patterns, edge halos, and temporal aliasing. Furthermore, the content defies common sense, generating illogical scenarios, nonsensical entities, absurd character behaviors, and conceptual paradoxes that violate basic human reasoning and everyday reality. The video looks like a surreal or glitchy hallucination. Overall, the video is of poor quality.
+---
+INSTRUCTIONS:
+Delete any words from the default negative prompt that contradict your intended video. Keep most of the original wording and structure intact, and do not add new items. Examples:
+* If you want scene cuts/montages -> REMOVE "jump cuts" and "hard cut".
+* If you want a motionless/static scene -> REMOVE "static with no motion".
+* If you want fantasy, sci-fi, or surrealism -> REMOVE "defies common sense", "illogical scenarios", "nonsensical entities", "surreal", and related logic-violation terms.
+* If the scene has flickering light -> REMOVE "flickering".
+* If it is a night-time timelapse -> REMOVE "motion blur".
+Output only the final negative prompt as a single paragraph, wrapped in <negative_prompt> tags. Do not output any explanation or preamble."""
+def image_to_data_url(path: Path) -> str:
+    """Encode a local image as a base64 data URL."""
+    mime = mimetypes.guess_type(path.name)[0] or "image/png"
+    encoded = base64.b64encode(path.read_bytes()).decode("ascii")
+    return f"data:{mime};base64,{encoded}"
+def extract_tag(text: str, tag: str) -> str | None:
+    """Return the stripped inner text of the first <tag>...</tag> block, if present."""
+    match = re.search(rf"<{tag}>(.*?)</{tag}>", text, flags=re.DOTALL)
+    return match.group(1).strip() if match else None
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Upsample an image-to-video prompt with a VLM.")
+    parser.add_argument("--image-path", type=Path, default=Path("assets/example_first_frame.png"))
+    parser.add_argument("--user-prompt", default="The dog flies into the outer space")
+    parser.add_argument("--output-path", type=Path, default=Path("scripts/upsampled.json"))
+    parser.add_argument("--model-name", required=True)
+    parser.add_argument("--base-url", required=True, metavar="<VLM-endpoint-url>")
+    return parser.parse_args()
+def invoke_vlm(image_path: Path, user_prompt: str, model_name: str, base_url: str) -> str:
+    """Call an OpenAI-compatible chat completions endpoint and return the assistant text."""
+    payload = {
+        "model": model_name,
+        "max_tokens": MAX_TOKENS,
+        "messages": [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image_url", "image_url": {"url": image_to_data_url(image_path)}},
+                    {"type": "text", "text": PROMPT_TEMPLATE.replace("{nl_description}", user_prompt.strip())},
+                ],
+            }
+        ],
+    }
+    headers = {"Authorization": f"Bearer {os.environ['PROMPT_UPSAMPLER_API_KEY']}"}
+    response = requests.post(f"{base_url.rstrip('/')}/chat/completions", json=payload, headers=headers)
+    response.raise_for_status()
+    return response.json()["choices"][0]["message"]["content"]
+def main() -> None:
+    args = parse_args()
+    content = invoke_vlm(args.image_path, args.user_prompt, args.model_name, args.base_url)
+    final_prompt = extract_tag(content, "final_prompt")
+    if final_prompt is None:
+        raise RuntimeError(f"Response missing <final_prompt> block:\n{content}")
+    # Pin the output parameters post-hoc (the template leaves them as placeholders).
+    data = json.loads(final_prompt)
+    data["duration"] = f"{int(NUM_FRAMES / FPS)}s"
+    data["fps"] = float(round(FPS))
+    data["resolution"] = {"H": HEIGHT, "W": WIDTH}
+    data["aspect_ratio"] = ASPECT_RATIO
+    record: dict = {"prompt": json.dumps(data, ensure_ascii=False)}
+    negative = extract_tag(content, "negative_prompt")
+    if negative:
+        record["negative_prompt"] = negative
+    args.output_path.parent.mkdir(parents=True, exist_ok=True)
+    args.output_path.write_text(json.dumps(record, ensure_ascii=False), encoding="utf-8")
+    print(f"PROMPT:\n{record['prompt']}")
+    print(f"\nNEGATIVE PROMPT:\n{record.get('negative_prompt', '')}")
+    print(f"\nWrote {args.output_path}")
+if __name__ == "__main__":
+    main()

text_tokenizer/added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

text_tokenizer/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,120 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- for message in messages %}
+    {%- if message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content_item in message.content %}
+                {%- if 'text' in content_item %}
+                    {{- content_item.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and message.content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

text_tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

text_tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

text_tokenizer/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654

text_tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

text_tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' }}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set image_count = namespace(value=0) %}\n{%- set video_count = namespace(value=0) %}\n{%- for message in messages %}\n    {%- if message.role == \"user\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content_item in message.content %}\n                {%- if 'text' in content_item %}\n                    {{- content_item.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and message.content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

transformer/config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "_class_name": "Cosmos3OmniTransformer",
+  "_diffusers_version": "0.38.0",
+  "action_dim": 32,
+  "action_gen": false,
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "base_fps": 16,
+  "dtype": "bfloat16",
+  "enable_fps_modulation": true,
+  "freeze_und": false,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 5120,
+  "initializer_range": 0.02,
+  "intermediate_size": 25600,
+  "joint_attn_implementation": "two_way",
+  "latent_channel": 48,
+  "latent_patch_size": 2,
+  "max_action_dim": 32,
+  "max_position_embeddings": 262144,
+  "model_type": "qwen3_vl_text",
+  "num_attention_heads": 64,
+  "num_embodiment_domains": 32,
+  "num_hidden_layers": 64,
+  "num_key_value_heads": 8,
+  "patch_latent_dim": 192,
+  "position_embedding_type": "unified_3d_mrope",
+  "qk_norm": false,
+  "qk_norm_for_diffusion": true,
+  "qk_norm_for_text": true,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "mrope_interleaved": true,
+    "mrope_section": [
+      24,
+      20,
+      20
+    ],
+    "rope_type": "default"
+  },
+  "rope_theta": 5000000,
+  "sound_dim": null,
+  "sound_gen": false,
+  "sound_latent_fps": 25,
+  "temporal_compression_factor_sound": 1,
+  "timestep_scale": 0.001,
+  "unified_3d_mrope_reset_spatial_ids": true,
+  "unified_3d_mrope_temporal_modality_margin": 15000,
+  "use_cache": true,
+  "use_moe": true,
+  "video_temporal_causal": false,
+  "vocab_size": 151936
+}

transformer/diffusion_pytorch_model-00001-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da7e1203782fdbcc24c3ebd965698bdb733c40de79a741a512c3e08106e0e404
+size 4932286736

transformer/diffusion_pytorch_model-00002-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:deef4952ef906fa0f4f28bf789bab114e224f6437f34ab25bbe6e1253224d750
+size 4802610968

transformer/diffusion_pytorch_model-00003-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73e619cfc4a34499be985f19f367dcf6213d952b7a6f538f74be22c45da48f4c
+size 4949368056

transformer/diffusion_pytorch_model-00004-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4af5caa0da9e09b3db7454b71d5de50b553ac0f9a08bcf9193a08de61dcc948b
+size 4802610968

transformer/diffusion_pytorch_model-00005-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:443bb738c375588f50ef23010ac5d01c3a8cd8ad05dbd500463fdab3fd1f7b61
+size 4949368096

transformer/diffusion_pytorch_model-00006-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86ec5a7fe509ad59512b59843972b55628298fe7c79422a29f9846758b4320f7
+size 4802611024

transformer/diffusion_pytorch_model-00007-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:277e2a2e055262b922bf20fc350331c7269cc653ab9aa98fc24bc79fd278aa8d
+size 4949368104

transformer/diffusion_pytorch_model-00008-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ad567bf9eae72d1001495df60fb9f4c1b53063517a3c7e9573efeef6b49d6182
+size 4802611024

transformer/diffusion_pytorch_model-00009-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d29572826e0f3dddf7300923157fea981784fccf77b9a3041810e5c61046c722
+size 4949368104

transformer/diffusion_pytorch_model-00010-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb0dc522049b96cc5eb624dbf355d6eb78981795705c47ce0508823ecc9c4f1f
+size 4802611024

transformer/diffusion_pytorch_model-00011-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4eec73f0e89965bffcd8dad0da82ecfe6e89def456a31c03fde1164de44b8364
+size 4949368104

transformer/diffusion_pytorch_model-00012-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:453fbbf4ac59b0f7340b57cf3dcd73d49cae356ba2aa9bd7797449e5925cbcd9
+size 4802611024

transformer/diffusion_pytorch_model-00013-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a21745d30b07533fcc1e3bcf8283bbf0cb94141b42a359b3599d5242246e2f76
+size 4949368104

transformer/diffusion_pytorch_model-00014-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:595dfc0caa65e8c18f3e9abd700e6d2fff2774047c9dbb04e001eb0530459e7f
+size 4802611024

transformer/diffusion_pytorch_model-00015-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3da0db1ef8cba18ae3e0077ef4c5eedeb23545ede99882df5ddf2d83800581fb
+size 4949368104

transformer/diffusion_pytorch_model-00016-of-00027.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5774ddf0cb924d0d003c6013bf9a80f7d8fb8cb76bbb3e8d1ec04a373452e648
+size 4802611024