Cosmos3-Nano-MLX

NVIDIA Cosmos 3 Nano (16B) running natively on Apple Silicon via MLX.

What works: text-to-video, image-to-video, text-to-image, and joint video+audio generation
Hardware tested: M4 Max 128GB — 256p in ~38s, 480p in ~4min, 720p in ~10min (30 steps, text KV cache)
What this is: a full-source MLX implementation with component-level numerical parity against the HuggingFace PyTorch reference, not a converted-weight drop

Important: Physical AI model scope

Cosmos 3 is NVIDIA's Physical AI world foundation model, designed for robotics, autonomous driving, smart spaces, and industrial simulation. It produces strong, physically coherent motion for on-distribution scenes (robot arms, dashcam driving, factory floors) but does not generalize well to arbitrary creative video prompts. This is a model characteristic, not a port limitation — NVIDIA's own playground gates to 9 curated physical-AI demo inputs.

Install and run

git clone https://github.com/lyonsno/cosmos3-mlx.git
cd cosmos3-mlx
uv venv && uv pip install -e ".[dev]"

# Download weights (~32GB BF16)
huggingface-cli download nvidia/Cosmos3-Nano --local-dir weights/Cosmos3-Nano

Text-to-video

from cosmos3_mlx.load import load_transformer, load_tokenizer
from cosmos3_mlx.pipeline import Cosmos3GenerationPipeline

model = load_transformer("weights/Cosmos3-Nano", reasoner_only=False)
tokenizer = load_tokenizer("weights/Cosmos3-Nano")
pipeline = Cosmos3GenerationPipeline(model=model, tokenizer=tokenizer, model_dir="weights/Cosmos3-Nano")

result = pipeline.generate(
    prompt="A car driving through a suburban intersection on a sunny day",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
)

Image-to-video

import numpy as np
from PIL import Image

img = np.array(Image.open("first_frame.jpg").convert("RGB"))
result = pipeline.generate(
    prompt="A car driving forward along a winding coastal road",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
    image=img,
)

With audio

result = pipeline.generate(
    prompt="A robot arm picks up an object from a table",
    num_frames=16, height=256, width=256,
    num_inference_steps=30, guidance_scale=6.0, seed=42,
    enable_audio=True,
)
# result["audio_latents"] → decode with decode_audio()

Numerical parity with HuggingFace PyTorch reference

Every component has been verified against the HF PyTorch implementation:

Component	Result	Reference
VAE decoder	Max pixel diff 0.000016, PSNR 122 dB	HF diffusers
VAE encoder (single-frame)	Cosine similarity 0.9998	HF diffusers
VAE encoder (chunked multi-frame)	Cosine similarity 0.9999	HF diffusers
Scheduler (UniPC)	Max diff 0.0000019 across 35 steps	HF diffusers
Transformer, t2v	Cosine 0.99992 (256p), 0.99984 (720p)	HF diffusers
Transformer, i2v	Cosine 0.99981–0.99990 per frame (720p)	HF diffusers

Text KV caching gives 9.87× speedup at 256p (text tokens constant across denoising steps). 105 tests passing.

Quantization

Bits	Model size	Quality
BF16	~32 GB	Reference
8-bit (affine, group_size=64)	~16 GB	Visually indistinguishable from BF16
4-bit (affine, group_size=64)	~8.5 GB	Severe degradation — not viable with standard affine quantization

Performance

Resolution	Frames	Time	Memory
256×256	16	~38s	~32GB (BF16)
256×256	32	~131s	~32GB (BF16)
480p (832×480)	16	~252s	~32GB (BF16)
720p (1280×720)	16	~591s	~32GB (BF16)

All timings on M4 Max 128GB with text KV caching enabled. BF16 requires 32GB+ unified memory. 8-bit quantization reduces model size to ~16GB (24GB Mac minimum with VAE + activations).

Prior art and attribution

Prior Cosmos3-Nano MLX/quantized conversions exist on Hugging Face (e.g., Reza2kn/Cosmos3-Nano-MLX-8bit). This repo focuses on a full-source MLX implementation with reproducible component parity receipts, end-to-end generation examples across all modalities (text/image/video/audio), and explicit Apple Silicon performance and hardware bounds.

Model weights are from nvidia/Cosmos3-Nano under the OpenMDW 1.1 license.

Limitations

Physical AI distribution only: produces near-static output for off-distribution creative prompts (e.g., object turntables, abstract scenes)
32GB+ memory at BF16: does not fit 16GB base Macs without quantization
4-bit quantization not viable: standard affine quantization degrades severely; NF4 or calibrated quantization needed for sub-16GB
Audio: joint denoising produces temporally synchronized sound; prompt adherence is model-dependent and best for on-distribution physical scenes

Source

Full implementation: github.com/lyonsno/cosmos3-mlx

Published by BasinShapers — maintained local-inference routes with receipts.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for BasinShapers/Cosmos3-Nano-MLX

Base model

nvidia/Cosmos3-Nano

Finetuned

(10)

this model