Cosmos3-Nano-MLX

NVIDIA Cosmos 3 Nano (16B) running natively on Apple Silicon via MLX.

  • What works: text-to-video, image-to-video, text-to-image, and joint video+audio generation
  • Hardware tested: M4 Max 128GB โ€” 256p in ~38s, 480p in ~4min, 720p in ~10min (30 steps, text KV cache)
  • What this is: a full-source MLX implementation with component-level numerical parity against the HuggingFace PyTorch reference, not a converted-weight drop

Important: Physical AI model scope

Cosmos 3 is NVIDIA's Physical AI world foundation model, designed for robotics, autonomous driving, smart spaces, and industrial simulation. It produces strong, physically coherent motion for on-distribution scenes (robot arms, dashcam driving, factory floors) but does not generalize well to arbitrary creative video prompts. This is a model characteristic, not a port limitation โ€” NVIDIA's own playground gates to 9 curated physical-AI demo inputs.

Install and run

git clone https://github.com/lyonsno/cosmos3-mlx.git
cd cosmos3-mlx
uv venv && uv pip install -e ".[dev]"

# Download weights (~32GB BF16)
huggingface-cli download nvidia/Cosmos3-Nano --local-dir weights/Cosmos3-Nano

Text-to-video

from cosmos3_mlx.load import load_transformer, load_tokenizer
from cosmos3_mlx.pipeline import Cosmos3GenerationPipeline

model = load_transformer("weights/Cosmos3-Nano", reasoner_only=False)
tokenizer = load_tokenizer("weights/Cosmos3-Nano")
pipeline = Cosmos3GenerationPipeline(model=model, tokenizer=tokenizer, model_dir="weights/Cosmos3-Nano")

result = pipeline.generate(
    prompt="A car driving through a suburban intersection on a sunny day",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
)

Image-to-video

import numpy as np
from PIL import Image

img = np.array(Image.open("first_frame.jpg").convert("RGB"))
result = pipeline.generate(
    prompt="A car driving forward along a winding coastal road",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
    image=img,
)

With audio

result = pipeline.generate(
    prompt="A robot arm picks up an object from a table",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
    enable_audio=True,
)
# result["audio_latents"] โ†’ decode with decode_audio()

Numerical parity with HuggingFace PyTorch reference

Every component has been verified against the HF PyTorch implementation:

Component Result Reference
VAE decoder Max pixel diff 0.000016, PSNR 122 dB HF diffusers
VAE encoder (single-frame) Cosine similarity 0.9998 HF diffusers
VAE encoder (chunked multi-frame) Cosine similarity 0.9999 HF diffusers
Scheduler (UniPC) Max diff 0.0000019 across 35 steps HF diffusers
Transformer, t2v Cosine 0.99992 (256p), 0.99984 (720p) HF diffusers
Transformer, i2v Cosine 0.99981โ€“0.99990 per frame (720p) HF diffusers

Text KV caching gives 9.87ร— speedup at 256p (text tokens constant across denoising steps). 105 tests passing.

Quantization

Bits Model size Quality
BF16 ~32 GB Reference
8-bit (affine, group_size=64) ~16 GB Visually indistinguishable from BF16
4-bit (affine, group_size=64) ~8.5 GB Severe degradation โ€” not viable with standard affine quantization

Performance

Resolution Frames Time Memory
256ร—256 16 ~38s ~32GB (BF16)
256ร—256 32 ~131s ~32GB (BF16)
480p (832ร—480) 16 ~252s ~32GB (BF16)
720p (1280ร—720) 16 ~591s ~32GB (BF16)

All timings on M4 Max 128GB with text KV caching enabled. BF16 requires 32GB+ unified memory. 8-bit quantization reduces model size to ~16GB (24GB Mac minimum with VAE + activations).

Prior art and attribution

Prior Cosmos3-Nano MLX/quantized conversions exist on Hugging Face (e.g., Reza2kn/Cosmos3-Nano-MLX-8bit). This repo focuses on a full-source MLX implementation with reproducible component parity receipts, end-to-end generation examples across all modalities (text/image/video/audio), and explicit Apple Silicon performance and hardware bounds.

Model weights are from nvidia/Cosmos3-Nano under the OpenMDW 1.1 license.

Limitations

  • Physical AI distribution only: produces near-static output for off-distribution creative prompts (e.g., object turntables, abstract scenes)
  • 32GB+ memory at BF16: does not fit 16GB base Macs without quantization
  • 4-bit quantization not viable: standard affine quantization degrades severely; NF4 or calibrated quantization needed for sub-16GB
  • Audio: joint denoising produces temporally synchronized sound; prompt adherence is model-dependent and best for on-distribution physical scenes

Source

Full implementation: github.com/lyonsno/cosmos3-mlx

Published by BasinShapers โ€” maintained local-inference routes with receipts.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for BasinShapers/Cosmos3-Nano-MLX

Finetuned
(10)
this model