Instructions to use BasinShapers/Cosmos3-Nano-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use BasinShapers/Cosmos3-Nano-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Cosmos3-Nano-MLX BasinShapers/Cosmos3-Nano-MLX
- Cosmos
How to use BasinShapers/Cosmos3-Nano-MLX with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Cosmos3-Nano-MLX
NVIDIA Cosmos 3 Nano (16B) running natively on Apple Silicon via MLX.
- What works: text-to-video, image-to-video, text-to-image, and joint video+audio generation
- Hardware tested: M4 Max 128GB โ 256p in ~38s, 480p in ~4min, 720p in ~10min (30 steps, text KV cache)
- What this is: a full-source MLX implementation with component-level numerical parity against the HuggingFace PyTorch reference, not a converted-weight drop
Important: Physical AI model scope
Cosmos 3 is NVIDIA's Physical AI world foundation model, designed for robotics, autonomous driving, smart spaces, and industrial simulation. It produces strong, physically coherent motion for on-distribution scenes (robot arms, dashcam driving, factory floors) but does not generalize well to arbitrary creative video prompts. This is a model characteristic, not a port limitation โ NVIDIA's own playground gates to 9 curated physical-AI demo inputs.
Install and run
git clone https://github.com/lyonsno/cosmos3-mlx.git
cd cosmos3-mlx
uv venv && uv pip install -e ".[dev]"
# Download weights (~32GB BF16)
huggingface-cli download nvidia/Cosmos3-Nano --local-dir weights/Cosmos3-Nano
Text-to-video
from cosmos3_mlx.load import load_transformer, load_tokenizer
from cosmos3_mlx.pipeline import Cosmos3GenerationPipeline
model = load_transformer("weights/Cosmos3-Nano", reasoner_only=False)
tokenizer = load_tokenizer("weights/Cosmos3-Nano")
pipeline = Cosmos3GenerationPipeline(model=model, tokenizer=tokenizer, model_dir="weights/Cosmos3-Nano")
result = pipeline.generate(
prompt="A car driving through a suburban intersection on a sunny day",
num_frames=16, height=256, width=256,
num_inference_steps=30, guidance_scale=6.0, seed=42,
)
Image-to-video
import numpy as np
from PIL import Image
img = np.array(Image.open("first_frame.jpg").convert("RGB"))
result = pipeline.generate(
prompt="A car driving forward along a winding coastal road",
num_frames=16, height=256, width=256,
num_inference_steps=30, guidance_scale=6.0, seed=42,
image=img,
)
With audio
result = pipeline.generate(
prompt="A robot arm picks up an object from a table",
num_frames=16, height=256, width=256,
num_inference_steps=30, guidance_scale=6.0, seed=42,
enable_audio=True,
)
# result["audio_latents"] โ decode with decode_audio()
Numerical parity with HuggingFace PyTorch reference
Every component has been verified against the HF PyTorch implementation:
| Component | Result | Reference |
|---|---|---|
| VAE decoder | Max pixel diff 0.000016, PSNR 122 dB | HF diffusers |
| VAE encoder (single-frame) | Cosine similarity 0.9998 | HF diffusers |
| VAE encoder (chunked multi-frame) | Cosine similarity 0.9999 | HF diffusers |
| Scheduler (UniPC) | Max diff 0.0000019 across 35 steps | HF diffusers |
| Transformer, t2v | Cosine 0.99992 (256p), 0.99984 (720p) | HF diffusers |
| Transformer, i2v | Cosine 0.99981โ0.99990 per frame (720p) | HF diffusers |
Text KV caching gives 9.87ร speedup at 256p (text tokens constant across denoising steps). 105 tests passing.
Quantization
| Bits | Model size | Quality |
|---|---|---|
| BF16 | ~32 GB | Reference |
| 8-bit (affine, group_size=64) | ~16 GB | Visually indistinguishable from BF16 |
| 4-bit (affine, group_size=64) | ~8.5 GB | Severe degradation โ not viable with standard affine quantization |
Performance
| Resolution | Frames | Time | Memory |
|---|---|---|---|
| 256ร256 | 16 | ~38s | ~32GB (BF16) |
| 256ร256 | 32 | ~131s | ~32GB (BF16) |
| 480p (832ร480) | 16 | ~252s | ~32GB (BF16) |
| 720p (1280ร720) | 16 | ~591s | ~32GB (BF16) |
All timings on M4 Max 128GB with text KV caching enabled. BF16 requires 32GB+ unified memory. 8-bit quantization reduces model size to ~16GB (24GB Mac minimum with VAE + activations).
Prior art and attribution
Prior Cosmos3-Nano MLX/quantized conversions exist on Hugging Face (e.g., Reza2kn/Cosmos3-Nano-MLX-8bit). This repo focuses on a full-source MLX implementation with reproducible component parity receipts, end-to-end generation examples across all modalities (text/image/video/audio), and explicit Apple Silicon performance and hardware bounds.
Model weights are from nvidia/Cosmos3-Nano under the OpenMDW 1.1 license.
Limitations
- Physical AI distribution only: produces near-static output for off-distribution creative prompts (e.g., object turntables, abstract scenes)
- 32GB+ memory at BF16: does not fit 16GB base Macs without quantization
- 4-bit quantization not viable: standard affine quantization degrades severely; NF4 or calibrated quantization needed for sub-16GB
- Audio: joint denoising produces temporally synchronized sound; prompt adherence is model-dependent and best for on-distribution physical scenes
Source
Full implementation: github.com/lyonsno/cosmos3-mlx
Published by BasinShapers โ maintained local-inference routes with receipts.
Model tree for BasinShapers/Cosmos3-Nano-MLX
Base model
nvidia/Cosmos3-Nano