Instructions to use DuoNeural/Cosmos3-Nano-GPTQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use DuoNeural/Cosmos3-Nano-GPTQ-4bit with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("DuoNeural/Cosmos3-Nano-GPTQ-4bit", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("DuoNeural/Cosmos3-Nano-GPTQ-4bit", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]Cosmos3-Nano-GPTQ-4bit
DuoNeural | 2026-06-02
Model Description
Cosmos3-Nano-GPTQ-4bit is a 4-bit GPTQ-quantized version of NVIDIA/Cosmos3-Nano, designed for deployment on consumer hardware with reduced VRAM requirements.
Base model: nvidia/Cosmos3-Nano (15.16B parameters, MoT architecture)
Quantization: GPTQ 4-bit weight-only, group-size 128
Target: transformer linear layers (507 total). VAE, sound tokenizer, and scheduler remain in BF16.
Architecture (Key Findings)
Cosmos3-Nano uses a Mixture-of-Transformers (MoT) architecture:
und_seq: text AR pathway (conditioning on prompt)gen_seq: video DiT pathway (denoising latent frames)- Both streams cross-attend in each of 36 transformer layers
- No standalone text-only inference mode โ text and video are processed jointly
Important: The action prediction head (action_modality_embed, action_proj_in/out) weights are absent from the Nano model's HuggingFace release. Cosmos3-Nano is video generation only.
Quantization Details
| Parameter | Value |
|---|---|
| Method | GPTQ (weight-only) |
| Bits | 4 |
| Group size | 128 |
| Linear layers quantized | 507 |
| Parameters quantized | 14.53B |
| Calibration prompts | 8 (diverse robot action descriptions) |
| Calibration mode | Image-mode (num_frames=1, 256ร256, 1 denoising step) |
| Kept in BF16 | VAE, sound tokenizer, scheduler |
Why GPTQ for Cosmos3?
Standard auto-gptq is incompatible with transformers 4.48.3 (no_init_weights import error). This model uses a custom GPTQ implementation that:
- Registers forward hooks on each linear layer to capture calibration inputs
- Computes Hessian estimates (H = X^T X) per layer during image-mode pipeline runs
- Applies column-wise GPTQ quantization with Cholesky-based second-order correction
- Falls back to absmax quantization if Cholesky decomposition fails (numerical stability)
Image-mode calibration is required because Cosmos3's MoT architecture does not support standalone text-only inference โ the transformer() method requires joint und_seq + gen_seq inputs.
Usage
from diffusers import Cosmos3OmniPipeline
import torch
pipe = Cosmos3OmniPipeline.from_pretrained(
"DuoNeural/Cosmos3-Nano-GPTQ-4bit",
torch_dtype=torch.bfloat16,
enable_safety_checker=False, # Required โ cosmos_guardrail not bundled
)
pipe.to("cuda")
output = pipe(
prompt="A robot arm performs a complex manipulation task.",
num_frames=57, # ~2.4 seconds at 24fps
height=480, width=848,
num_inference_steps=35,
guidance_scale=6.0,
)
# output.video is a list of PIL frames
Memory Requirements
| Format | VRAM (approx.) |
|---|---|
| BF16 (original) | ~30 GB |
| GPTQ 4-bit (this model) | ~8โ10 GB (transformer) + ~2 GB (VAE/other) |
Note: Actual VRAM depends on video resolution and frame count. The diffusion process may require additional VRAM for intermediate activations.
Dependencies
torch>=2.6.0
diffusers @ git HEAD # requires two patches to attention_dispatch.py
transformers==4.48.3
Required diffusers patches (attention_dispatch.py):
- Comment out
@_custom_op("_diffusers_flash_attn_3::_flash_attn_forward", ...)at line 748 - Comment out
@_register_fake("_diffusers_flash_attn_3::_flash_attn_forward")at line 792
Packing Verification (2026-06-02)
Integer roundtrip test: EXACT โ nibble packing/unpacking is bit-perfect. The stored qweight tensors faithfully reproduce the quantized integer codes (0โ15) when unpacked.
| Test | Result |
|---|---|
| Integer roundtrip (packโunpackโcompare) | EXACT (0 mismatches) |
| Float16 scale precision loss | < 2.85e-5 max |
| Zero int8 precision loss | 0 (values in [0,15] range) |
| Reconstruction error (mean, vs original BF16) | ~0.0017 |
| Max reconstruction error | ~0.046 |
Note on reconstruction error: The pack script re-quantized already-quantized BF16 weights (from the GPTQ pass), which constitutes double quantization. The ~0.046 max error is expected from this double quantization; it is not a packing defect. A future version could avoid this by packing integer codes directly from the GPTQ pass.
Limitations
- 4-bit quantization may introduce quality degradation in fine video details
- Custom nibble format โ not compatible with auto-gptq, exllama, or standard GPTQ loaders. Manual unpacking required (see packing verification above for the format)
- Double quantization: pack script re-quantized already-quantized BF16 weights, adding a second rounding step (~5ร scale bucket max error vs single-pass). Future versions should pack directly from GPTQ integer codes
- Calibration used only 8 prompts (physical robot action descriptions) โ quality on very different domains may vary
- Cosmos3's native
cosmos_guardrailsafety checker is not bundled; useenable_safety_checker=False - Action prediction head absent from Nano โ video generation only, no robot action outputs
Research Context
This quantization is part of DuoNeural's Cosmos3 analysis portfolio, alongside:
DuoNeural/Cosmos3-Nano-Abliteratedโ projected abliteration on text conditioning pathway
Key findings from our Cosmos3 probe work:
- Cosmos3's safety behavior emerges in the text conditioning pathway (
und_seq), not thecosmos_guardrailwrapper - Late-layer safety representation (direction norms grow 2โ24 across layers 15โ35)
- MoT architecture is DHP-immune by construction (full-attention, no recurrent state)
- Action head completely absent from Nano HuggingFace release
Citation
@misc{duoneural2026cosmos3gptq,
title={Cosmos3-Nano-GPTQ-4bit: 4-bit Quantization of NVIDIA Cosmos3-Nano},
author={Archon and Jesse and Aura},
year={2026},
publisher={DuoNeural},
url={https://huggingface.co/DuoNeural/Cosmos3-Nano-GPTQ-4bit}
}
DuoNeural Research Lab | archon@agentmail.to | duoneural.com
Papers: Zenodo Community | Models: HuggingFace
Kilonova UMA Inference Test โ AMD 780M, 16GB VRAM (2026-06-02)
CONFIRMED WORKING on kilonova (AMD Radeon 780M, 16GB UMA, ROCm).
Measured VRAM
| Stage | VRAM Used |
|---|---|
| After BF16 pipeline load (mmap, CPU) | ~0 GPU |
| After patching 329 Linear โ QuantizedLinear | 6.5 GB |
After pipe.to("cuda") (full pipeline on GPU) |
14.4 GB / 17.2 GB |
| During inference (peak, 20 steps) | ~14.4 GB (stable) |
No offloading, no ZRAM swap. The BF16 version required 30GB and thrashed swap for 18+ hours.
Inference Speed
~14โ15 seconds/step at 128ร128 resolution. 20 steps โ 5 minutes/video on AMD 780M (6.27 TFLOPS).
Resolution Notes
- 128ร128, 8 frames: works, no OOM. Output quality degraded (off training distribution)
- 256ร256, 16 frames: OOM during VAE decode (transformer uses 13.9GB, VAE decode needs 1.69GB additional)
- Fix for 256ร256: use
pipe.enable_model_cpu_offload()instead ofpipe.to("cuda")โ trades VRAM for speed PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truerecommended
Output Quality at 128ร128
- 2/3 prompts produced structured, colorful output (pixel_std 44โ52 on 0-255 scale)
- 1/3 prompt produced black frames (NaN latent โ seed-dependent initialization issue, not a model defect)
- Output is blurry at 128ร128 โ expected, model trained at 720ร1280
Correct Loading Approach (Kilonova / Memory-Constrained GPUs)
The GPTQ repo contains transformer weights only (packed shards + config). The pipeline components (VAE, tokenizer, scheduler) must come from a full cached pipeline. Use this pattern:
from diffusers import Cosmos3OmniPipeline
import safetensors.torch as st
import torch, torch.nn as nn, torch.nn.functional as F
from pathlib import Path
from huggingface_hub import snapshot_download
GROUP_SIZE = 128
DTYPE = torch.bfloat16
class QuantizedLinear(nn.Module):
"""DuoNeural int32 nibble-pack format (8 int4 per int32, LSB first)."""
def __init__(self, qweight, scales, zeros, bias, out_f, in_f):
super().__init__()
self.register_buffer('qweight', qweight) # (out, in//8) int32
self.register_buffer('scales', scales) # (out, n_groups) float16
self.register_buffer('zeros', zeros) # (out, n_groups) int8
self.out_features, self.in_features = out_f, in_f
self.bias = None if bias is None else nn.Parameter(bias.clone())
def dequantize(self):
qw = self.qweight
out_f, packed_in = qw.shape
in_f = packed_in * 8
n_groups = in_f // GROUP_SIZE
shifts = torch.arange(0, 32, 4, device=qw.device, dtype=torch.int32)
nibbles = (qw.unsqueeze(-1) >> shifts.view(1, 1, 8)) & 0xF
w = nibbles.reshape(out_f, n_groups, GROUP_SIZE).float()
w = ((w - self.zeros.float().unsqueeze(-1)) * self.scales.float().unsqueeze(-1))
return w.reshape(out_f, in_f).to(DTYPE)
def forward(self, x):
w = self.dequantize()
out = F.linear(x, w.to(x.dtype), self.bias)
del w
return out
# 1. Download GPTQ packed shards
gptq_dir = snapshot_download(
"DuoNeural/Cosmos3-Nano-GPTQ-4bit",
cache_dir="/path/to/cache", # must be on disk with ~6GB free, NOT tmpfs
)
# 2. Load base pipeline (needs a full cached pipeline โ e.g. Cosmos3-Nano-Abliterated)
pipe = Cosmos3OmniPipeline.from_pretrained(
"DuoNeural/Cosmos3-Nano-Abliterated", # or any full Cosmos3 pipeline
torch_dtype=DTYPE, device_map="cpu",
low_cpu_mem_usage=True, local_files_only=True,
enable_safety_checker=False,
)
# 3. Load packed shards and patch transformer
packed_state = {}
for pf in sorted(Path(gptq_dir).glob("model-*-packed.safetensors")):
packed_state.update(st.load_file(str(pf), device="cpu"))
def patch_gptq(model, state):
params = {}
for k, v in state.items():
for suffix in ['.weight.qweight', '.weight.scales', '.weight.zeros']:
if k.endswith(suffix):
path = k[:-len(suffix)]
field = suffix.lstrip('.weight.')
params.setdefault(path, {})[field] = v
def get_nested(m, parts):
for p in parts: m = getattr(m, p)
return m
def set_nested(m, parts, v):
for p in parts[:-1]: m = getattr(m, p)
setattr(m, parts[-1], v)
patched = 0
for path, p in params.items():
if len(p) < 3: continue
parts = path.split('.')
try: orig = get_nested(model, parts)
except AttributeError: continue
if not isinstance(orig, nn.Linear): continue
qw = p['qweight']
ql = QuantizedLinear(
qweight=qw.to('cuda').to(torch.int32),
scales=p['scales'].to('cuda').to(torch.float16),
zeros=p['zeros'].to('cuda').to(torch.int8),
bias=orig.bias, out_f=qw.shape[0], in_f=qw.shape[1]*8,
)
set_nested(model, parts, ql)
patched += 1
return model, patched
_, n = patch_gptq(pipe.transformer, packed_state)
print(f"Patched {n} linear layers")
del packed_state
# 4. Move to GPU
pipe = pipe.to("cuda") # 14.4 GB VRAM for full pipeline
# OR: pipe.enable_model_cpu_offload() # slower but handles any resolution
output = pipe(
prompt="A robotic arm picks red blocks.",
num_frames=8, height=128, width=128,
num_inference_steps=20, guidance_scale=7.0,
)
Important: State Dict Key Format
Keys use .weight.qweight not .qweight:
layers.0.mlp.down_proj.weight.qweight # โ strip '.weight.qweight' for module path
layers.0.mlp.down_proj.weight.scales
layers.0.mlp.down_proj.weight.zeros
Launch Flags (AMD ROCm)
HSA_ENABLE_SDMA=0 GPU_MAX_HW_QUEUES=1 TORCHDYNAMO_DISABLE=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 your_script.py
- Downloads last month
- -
Model tree for DuoNeural/Cosmos3-Nano-GPTQ-4bit
Base model
nvidia/Cosmos3-Nano