Instructions to use Reza2kn/Cosmos3-Nano-INT4-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Reza2kn/Cosmos3-Nano-INT4-AWQ with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Reza2kn/Cosmos3-Nano-INT4-AWQ", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Cosmos
How to use Reza2kn/Cosmos3-Nano-INT4-AWQ with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
Cosmos3-Nano β INT4-AWQ (4-bit weight-only)
A 4-bit (INT4) weight-only, activation-aware (AWQ) quantization of
nvidia/Cosmos3-Nano, produced with NVIDIA
TensorRT Model Optimizer. The transformer's attention + FFN linears (~11.8 B params, 77.6%
of the transformer) are quantized to INT4 with AWQ scaling; embeddings, all norms, the
diffusion time-embedder, and modality I/O adapters stay in BF16. Activations remain BF16.
Derivative of
nvidia/Cosmos3-Nano. Β© NVIDIA. Distributed under OpenMDW-1.1 (license text
- NVIDIA's original copyright/origin notices retained, per the license). Not affiliated with, nor endorsed by, NVIDIA.
Quality vs BF16 (96-prompt anatomy-weighted sweep)
Diffusion sampling is chaotic: a quantized model yields a different but equally-valid sample at a given seed, so pixel-identity is not a meaningful target. We measure preference quality and distribution match instead.
| Metric | BF16 | INT4-AWQ |
|---|---|---|
| PickScore (human preference; higher better) | 21.85 | 21.88 (Ξ +0.04) |
| FID vs BF16 (lower=closer) | β | 95.8 |
| Functional fidelity (velocity cosine, identical inputs) | 1.000 | ~0.998 |
| Worst-case PickScore drop over 96 prompts | β | β0.97 (tightest of all recipes tested) |
Context for FID: BF16-vs-BF16 at a different seed (same prompts, N=96) scores FID 138.6 β so 95.8 is well below the seed-noise floor: the quantized model tracks BF16 more closely than BF16 tracks itself across seeds.
Honest caveats: quality is preserved on typical content. The hardest cases β dense interlocking
hands and on-image text spelling β can still wobble; these are base-model-difficulty failure modes
present in BF16 too, not introduced by quantization. See worst_case_contact_sheet.png.
Usage
import torch
from huggingface_hub import snapshot_download
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
import modelopt.torch.opt as mto
repo = snapshot_download("Reza2kn/Cosmos3-Nano-INT4-AWQ")
tf = Cosmos3OmniTransformer.from_config(
Cosmos3OmniTransformer.load_config(f"{repo}/transformer/config.json")).to(torch.bfloat16)
mto.restore(tf, f"{repo}/transformer/modelopt_quantized.pt") # restores 4-bit weights
pipe = Cosmos3OmniPipeline.from_pretrained(
repo, transformer=tf, torch_dtype=torch.bfloat16, enable_safety_checker=False).to("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16): # required (see note)
img = pipe("A red panda astronaut floating in a nebula", num_frames=1,
height=480, width=480).video[0][0]
img.save("out.png")
Or just from load_quantized import load; pipe = load() (helper script included).
Requirements: diffusers (git main / β₯0.39 β Cosmos3 support), nvidia-modelopt, torch (cu128
for Blackwell). The autocast is required: a few positional/rotary tensors are computed in float32
on the fly and must be cast to bf16 before hitting the 4-bit linears.
Deployment note
4-bit weight packing gives real memory savings now (transformer 30 GB β ~10 GB). Native inference speedups require a runtime with Cosmos3-aware INT4 kernels; with modelopt the model runs in dequant mode, preserving quality at ~BF16 speed.
Method
- modelopt
INT4_AWQ_CFG(awq_lite), weight-only; calibrated on a multimodal image+video prompt set through the real denoising loop. - Quantized:
self_attn.{to_q,to_k,to_v,to_out,add_q_proj,add_k_proj,add_v_proj,to_add_out},mlp.*,mlp_moe_gen.*,lm_head. Kept BF16: embeddings, modality embeds, norms (incl. QK-norm),time_embedder,proj_in/proj_out, audio/action adapters.
- Downloads last month
- 9
Model tree for Reza2kn/Cosmos3-Nano-INT4-AWQ
Base model
nvidia/Cosmos3-Nano