Model Overview

Description:

Cosmos-Embed1 is a joint video-text embedder tailored for physical AI. It can be used for text-to-video retrieval, inverse video search, semantic deduplication, zero-shot and k-nearest-neighbors (kNN) classification, and as a base model for video curation tasks. It has state-of-the-art (SOTA) performance on autonomous vehicle (AV) and robotics datasets, while maintaining competitive performance in general domains. A fine-tuned variant is also provided for video anomaly detection and classification. This model is ready for commercial use.

The Cosmos-Embed1 release includes the following embedders:

Variant	Resolution	Frames	Embedding Dim
Cosmos-Embed1-224p	224×224	8	256
Cosmos-Embed1-336p	336×336	8	768
Cosmos-Embed1-448p	448×448	8	768

Note: while each checkpoint was optimized at a specific fixed resolution (and default to these), they all support arbitrary non-square resolutions.

In addition, a fine-tuned variant is provided for anomaly detection applications:

Variant	Base Model	Resolution	Frames	Fine-tuning Dataset	Embedding Dim
Cosmos-Embed1-448p-anomaly-detection	Cosmos-Embed1-448p	448×448	8	Vad-Reasoning (training set)	768

The Cosmos-Embed1-448p-anomaly-detection model is fine-tuned from Cosmos-Embed1-448p using LoRA (Low-Rank Adaptation) on the training set of the Vad-Reasoning dataset, a video anomaly detection and reasoning dataset covering traffic, campus, urban, and other real-world scenarios. LoRA is applied to preserve the base model's generalizability while adapting it for anomaly detection, anomaly classification, and video retrieval in diverse video understanding applications.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License. Additional Information: Apache 2.0 and MIT.

Deployment Geography:

Global

Use Case:

Physical AI developers and engineers for video-text embedding tasks including text-to-video retrieval, video-to-video search, zero-shot classification, and semantic deduplication across robotics, autonomous vehicles (AV), video anomaly detection, and diverse video understanding domains.

Release Date:

Hugging Face 06/15/2025 Model Link

References(s):

Li, Junnan, et al. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." International conference on machine learning. PMLR, 2023.

Huang, Chao, et al. "Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought." arXiv:2505.19877, 2025.

NVIDIA Cosmos-Embed1 Research Page: https://research.nvidia.com/labs/dir/cosmos-embed1/.

Model Architecture:

Architecture Type: Transformer

Network Architecture: QFormer-based video-text embedder with EVA-ViT-G visual backbone

** The architecture is based on QFormer, with modifications for processing video inputs. The video embedder processes frames individually with a ViT backbone. The per-frame ViT features are concatenated in the temporal dimension and augmented with temporal embeddings. These are then passed into the QFormer which summarizes via cross-attention a compact set of visual query tokens from the provided frames. The visual query tokens are then pooled into a single video embedding.

The text embedder processes tokenized text via the self-attention branch of the QFormer to produce a text embedding. The normalized text and video embeddings are aligned via a contrastive video-text loss, as well as auxiliary losses such as video-text matching and video captioning. For the 336p and 448p variants, additional summary and dense distillation losses are used.

** Number of model parameters: ~1.0×10^9

Input(s):

The model operates in one of two modes per forward pass — text encoding or video encoding — not both simultaneously. Each mode accepts a single input type and produces a corresponding embedding.

Input Type(s): Text or Video (one per forward pass)
Input Format(s):

Text mode: String (UTF-8)
Video mode: Tensor scaled from 0 to 1 of RGB frame sequences

Input Parameters:

Text mode: One-Dimensional (1D)
Video mode: Three-Dimensional (3D)

Other Properties Related to Input:

Text input will be truncated or padded to 128 text tokens. When used for text-to-video (T2V) retrieval, it should contain a short description of the object, scene, or action of interest.
Arbitrary, non-square resolutions are supported. Each variant defaults to its optimized resolution (224×224, 336×336, or 448×448).
The model architecture supports input videos of varying lengths, but it has been optimized for 8 frames, sampled at 1–2 frames per second (FPS).
Video tensor shape: [batch_size, channels, num_frames, height, width]

Output(s)

For each forward pass, the model produces a single 1D embedding — either a text embedding or a video embedding, depending on the input mode. The two embedding types live in a shared vector space, enabling cross-modal similarity (e.g., ranking videos by relevance to a text query via cosine similarity).

Output Type(s): Text embedding or Video embedding (one per forward pass)
Output Format(s):

Text mode: Floating-point normalized embedding vector
Video mode: Floating-point normalized embedding vector

Output Parameters:

Text mode: One-Dimensional (1D)
Video mode: One-Dimensional (1D)

Other Properties Related to Output: Continuous-valued L2-normalized feature vectors with a dimensionality of 256 (224p variant) or 768 (336p and 448p variants). A distance can be calculated between embeddings using cosine similarity. Dense intermediate feature maps are also provided for convenience.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

PyTorch
Transformer Engine (optional, for faster inference)
ONNX Runtime (via ONNX export)
NVIDIA TensorRT (via ONNX export)

Supported Hardware Microarchitecture Compatibility:

on NVIDIA Ampere
NVIDIA Hopper
NVIDIA Blackwell

Note: Cosmos-Embed1 has been tested with BF16 precision on NVIDIA Ampere and Hopper GPUs. Older GPU architectures (e.g., NVIDIA Volta) may require switching to FP32 precision.

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Cosmos-Embed1-224p — 224×224, 256-dim embeddings
Cosmos-Embed1-336p — 336×336, 768-dim embeddings
Cosmos-Embed1-448p — 448×448, 768-dim embeddings
Cosmos-Embed1-448p-anomaly-detection — 448×448, 768-dim embeddings, fine-tuned on Vad-Reasoning for anomaly detection

Training, Testing, and Evaluation Datasets:

Dataset Overview

** Total Size: 2,193 videos used (1,755 training + 438 test), from the Vad-Reasoning dataset (8,641 total)
** Total Number of Datasets: 7 source datasets (UCF-Crime, XD-Violence, TAD, ShanghaiTech, UBnormal, ECVA, and internet videos) aggregated into the Vad-Reasoning dataset

The Cosmos-Embed1-448p-anomaly-detection variant is fine-tuned and evaluated on the Vad-Reasoning dataset, introduced in Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought (Huang et al., 2025). The dataset covers diverse real-world traffic, campus, urban, and similar scenarios with a fine-grained anomaly taxonomy: Human Activity Anomaly, Environments Anomaly, and Objects Anomaly — each further divided into subcategories.

Training Dataset:

** Data Modality

Video

** Video Training Data Size

Less than 10,000 Hours

Vad-Reasoning-SFT Training Set: 1,755 videos with high-quality Chain-of-Thought annotations, sourced from multiple public video anomaly detection datasets (only the SFT subset was used; the 6,448-video RL subset was not used). The original videos are split into 5-second chunks for training. Each chunk is paired with a text caption derived from the anomaly_type field in the dataset annotations.

Source Dataset	Scenario
UCF-Crime	Real-world crime and anomaly footage
XD-Violence	Violent events
TAD	Traffic anomalies
ShanghaiTech	Campus scenes
UBnormal	Urban/city scenes
ECVA	Multi-scene benchmark
Internet videos	Additional anomaly categories

The dataset contains 24 unique anomaly categories used as text captions for contrastive training and zero-shot evaluation. Video chunks that do not contain an anomaly event are labeled as "Normal" during fine-tuning:

#	Anomaly Type	#	Anomaly Type	#	Anomaly Type
1	Abuse	9	Fighting	17	Riot
2	Animals Obstructing Traffic	10	Fire	18	Robbery
3	Arson	11	Flooding or Tsunami	19	Shooting
4	Avalanche and Landslide	12	Illegal Lane Changing	20	Stealing
5	Dangerous Items	13	Illegal Parking	21	Tornado
6	Explosion	14	Obstacles on Road	22	Traffic Accidents
7	Falling	15	Pedestrian Jaywalking	23	Vandalism
8	Falling Objects	16	Red Light Violation	24	Wrong-Way Driving

** Data Collection Method by dataset

Hybrid: Human, Automated

** Labeling Method by dataset

Hybrid: Human, Automated (multi-stage annotation pipeline using proprietary models)

Properties: 1,755 video clips (Vad-Reasoning-SFT subset) covering public spaces, traffic, campus, and urban environments. Content includes anomaly categories (Human Activity, Environments, Objects) with fine-grained subcategories. No personal data or copyright-protected content is directly used; source datasets are publicly available research datasets.

Testing Dataset:

Data Collection Method by dataset:

Hybrid: Human, Automated

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: The Vad-Reasoning test set was used for testing during fine-tuning. See Evaluation Dataset below for details.

Evaluation Dataset:

1. Vad-Reasoning Test Set

Vad-Reasoning Test Set: 438 videos from the same distribution as the training set, used for evaluating anomaly classification and retrieval performance. As with training, the original videos are split into 5-second chunks, and each chunk's ground-truth label is the anomaly_type from the dataset annotations.

Data Collection Method by dataset:

Hybrid: Human, Automated

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: 438 video clips from public cameras and general-purpose cameras for anomaly classification and retrieval tasks. Same anomaly taxonomy as training set (Human Activity Anomaly, Environments Anomaly, Objects Anomaly).

2. Kinetics-400 Validation Set

Kinetics-400: A large-scale human action recognition dataset containing ~400 human action classes sourced from publicly available internet scale data. Each clip is approximately 10 seconds long and annotated with a single action class. The validation set contains 19,877 video clips and is used to evaluate the base Cosmos-Embed1-448p model on general-domain zero-shot action classification.

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties: 19,877 video clips (~10s each) covering 400 human action classes including human-object interactions (e.g., playing instruments) and human-human interactions (e.g., shaking hands, hugging).

Benchmark Scores:

Vad-Reasoning Test Set

Evaluated on the Vad-Reasoning test set (438 videos) using zero-shot classification via video-text similarity.

The key performance indicators are:

Top-K Hit Rate for zero-shot anomaly classification (K = 1, 5, 10) — measures whether the correct anomaly class label appears among the K nearest text embeddings for each video
MRR (Mean Reciprocal Rank) — the average reciprocal of the rank at which the correct label first appears
F1 (macro) — average per-class F1 score, the harmonic mean of per-class precision and recall
Recall (macro) — average per-class recall, treating all classes equally regardless of sample count

Metric	Cosmos-Embed1-448p	Cosmos-Embed1-448p-anomaly-detection
Top-1 Hit Rate	23.21%	46.44%
Top-3 Hit Rate	34.81%	73.95%
Top-5 Hit Rate	45.98%	83.71%
Top-10 Hit Rate	67.24%	94.50%
MRR	0.3557	0.6299
Macro F1	19.51%	38.94%

Kinetics-400 Validation Set

Evaluated on the Kinetics-400 validation set (19,877 videos) using zero-shot action classification via video-text similarity.

The key performance indicators are:

Top-K Accuracy (K = 1, 5) — fraction of videos whose ground-truth action class appears in the top-K predictions
Recall (macro) and F1 (macro) — as defined above

Metric	Cosmos-Embed1-448p	Cosmos-Embed1-448p-anomaly-detection
Top-1 Accuracy	87.96%	85.18%
Top-5 Accuracy	97.58%	96.47%
Recall (macro)	87.95%	85.17%
F1 (macro)	87.89%	85.36%

Inference:

Acceleration Engine: PyTorch, ONNX Runtime
Test Hardware:

H100
A100

How to use this model

Using HuggingFace Transformers

import decord
import numpy as np
import torch
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained(
    "nvidia/Cosmos-Embed1-448p", trust_remote_code=True
).to("cuda", dtype=torch.bfloat16)
preprocess = AutoProcessor.from_pretrained(
    "nvidia/Cosmos-Embed1-448p", trust_remote_code=True
)

# Load video frames
reader = decord.VideoReader("/path/to/video.mp4")
frame_ids = np.linspace(0, len(reader) - 1, 8, dtype=int).tolist()
frames = reader.get_batch(frame_ids).asnumpy()
batch = np.transpose(np.expand_dims(frames, 0), (0, 1, 4, 2, 3))  # BTCHW

captions = ["a person walking", "a car driving on a highway"]

# Compute embeddings
video_inputs = preprocess(videos=batch).to("cuda", dtype=torch.bfloat16)
video_out = model.get_video_embeddings(**video_inputs)

text_inputs = preprocess(text=captions).to("cuda", dtype=torch.bfloat16)
text_out = model.get_text_embeddings(**text_inputs)

# Rank captions by similarity
probs = torch.softmax(
    model.logit_scale.exp() * video_out.visual_proj @ text_out.text_proj.T,
    dim=-1,
)[0]
print(captions[probs.argmax()])

Fine-tuning with TAO

Cosmos-Embed1 supports fine-tuning via the TAO Toolkit. The model can be fine-tuned using full training or LoRA (Low-Rank Adaptation) for parameter-efficient training. TAO also supports:

Evaluation — top-K classification metrics (hit rate, MRR, macro precision) and embedding visualization (UMAP/t-SNE). Supports embedding caching via pkl files for fast repeated evaluation.
Inference — text-to-video and video-to-video similarity search. Given text or video queries, retrieves the K most similar items from a video corpus with similarity scores.
ONNX Export — export to ONNX for deployment in video, text, or combined mode.
HuggingFace Export — export to HuggingFace format (sharded safetensors, config, tokenizer) for sharing and reuse with the transformers library.

A Jupyter notebook with end-to-end workflow is available at

Example training spec:

model:
  pretrained_model_path: /model/Cosmos-Embed1-224p
  precision: bf16
  network:
    embed_dim: 256
    num_video_frames: 8
train:
  num_gpus: 1
  max_iter: 1000
  freeze_visual_encoder: true
  optim:
    lr: 1.0e-05

ONNX Export

The model can be exported to ONNX in three modes:

Mode	ONNX Inputs	ONNX Outputs
`video`	`videos (B,3,T,H,W)`	`video_embedding (B, embed_dim)`
`text`	`input_ids (B,seq_len)`, `attention_mask (B,seq_len)`	`text_embedding (B, embed_dim)`
`combined`	all of above	both of above

export:
  checkpoint: /results/train/cosmos_embed1_model_latest.pth
  mode: combined # video + text
  onnx_file: /results/export/cosmos_embed1_video.onnx
  batch_size: -1
  simplify: true

The exported ONNX file can be further converted to a TensorRT engine for optimized inference on NVIDIA GPUs using trtexec or the TensorRT Python API.

HuggingFace Export

The trained model can be exported to HuggingFace format, producing a self-contained directory with sharded safetensors weights, config.json compatible with AutoModel.from_pretrained(), and tokenizer files. This allows loading with the transformers library or uploading to the HuggingFace Hub.

export:
  checkpoint: /results/train/cosmos_embed1_model_latest.pth
  mode: huggingface
  hf_output_dir: /results/export_hf/cosmos_embed1_hf
  on_cpu: true

Once exported, the model can be loaded directly:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "/results/export_hf/cosmos_embed1_hf",
    trust_remote_code=True,
)

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing:	None
Measures taken to mitigate against unwanted bias:	None

Explainability

Field	Response
Intended Task/Domain:	Joint video-text embedding for physical AI applications (robotics, autonomous vehicles (AV), search, general video understanding) and video anomaly detection and classification.
Model Type:	Transformer
Intended Users:	Physical AI developers working on robotics, autonomous vehicles (AV), search, video understanding, and video anomaly detection tasks.
Output:	L2-normalized text and video embedding vectors (256 dimensions for 224p inputs; 768 dimensions for 336p/448p inputs).
Describe how the model works:	Video frames are individually processed by a ViT backbone, temporally augmented, and compressed via QFormer cross-attention into a single video embedding. Text is processed by the QFormer self-attention branch into a text embedding. Both embeddings are aligned via contrastive learning.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable
Technical Limitations & Mitigation:	Due to the training datasets being predominantly composed of short action-focused English captions, different text prompts may not be properly aligned to video data, producing suboptimal matching. The model has been optimized for 8 frames at 1–2 FPS; significantly different frame rates or video lengths may degrade performance. Fine-tuning on domain-specific data can mitigate these limitations.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Retrieval metrics (top-K hit rate, mean reciprocal rank) and classification metrics (precision, recall, F1)
Potential Known Risks:	The model may produce suboptimal embeddings to unsafe text prompts.
Licensing:	GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License. Additional Information: Apache 2.0 and MIT.

Privacy

Field	Response
Generatable or reverse engineerable personal data?	No
Personal data used to create this model?	Yes.
Was consent obtained for any personal data used?	Unknown
How often is dataset reviewed?	Before Release
Is a mechanism in place to honor data subject right of access or deletion of personal data?	No, not possible with externally sourced-data
If personal data was collected for the development of the model, was it collected directly by NVIDIA?	No
If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects?	Not Applicable
If personal data was collected for the development of this AI model, was it minimized to only what was required?	Not Applicable
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model?	No
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	No, not possible with externally-sourced data.
Applicable Privacy Policy	https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field	Response
Model Application Field(s):	Industrial/Machinery and Robotics, Video Understanding
Describe the life critical impact (if present).	This model produces video and text embeddings for tasks such as text-to-video retrieval, video-to-video search, zero-shot classification, and semantic deduplication. It is intended for use in physical AI applications encompassing robotics, autonomous vehicles (AV), and anomaly detection. The fine-tuned variant (Cosmos-Embed1-448p-finetuned_anomaly_detection) is additionally optimized for video anomaly detection and classification.
Use Case Restrictions:	Abide by the NVIDIA Open Model License. Additional Information: Apache 2.0 and MIT.
Model and dataset restrictions:	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog.