Cosmos3-Nano-GPTQ-4bit

DuoNeural | 2026-06-02

Model Description

Cosmos3-Nano-GPTQ-4bit is a 4-bit GPTQ-quantized version of NVIDIA/Cosmos3-Nano, designed for deployment on consumer hardware with reduced VRAM requirements.

Base model: nvidia/Cosmos3-Nano (15.16B parameters, MoT architecture)
Quantization: GPTQ 4-bit weight-only, group-size 128
Target: transformer linear layers (507 total). VAE, sound tokenizer, and scheduler remain in BF16.

Architecture (Key Findings)

Cosmos3-Nano uses a Mixture-of-Transformers (MoT) architecture:

  • und_seq: text AR pathway (conditioning on prompt)
  • gen_seq: video DiT pathway (denoising latent frames)
  • Both streams cross-attend in each of 36 transformer layers
  • No standalone text-only inference mode โ€” text and video are processed jointly

Important: The action prediction head (action_modality_embed, action_proj_in/out) weights are absent from the Nano model's HuggingFace release. Cosmos3-Nano is video generation only.

Quantization Details

Parameter Value
Method GPTQ (weight-only)
Bits 4
Group size 128
Linear layers quantized 507
Parameters quantized 14.53B
Calibration prompts 8 (diverse robot action descriptions)
Calibration mode Image-mode (num_frames=1, 256ร—256, 1 denoising step)
Kept in BF16 VAE, sound tokenizer, scheduler

Why GPTQ for Cosmos3?

Standard auto-gptq is incompatible with transformers 4.48.3 (no_init_weights import error). This model uses a custom GPTQ implementation that:

  1. Registers forward hooks on each linear layer to capture calibration inputs
  2. Computes Hessian estimates (H = X^T X) per layer during image-mode pipeline runs
  3. Applies column-wise GPTQ quantization with Cholesky-based second-order correction
  4. Falls back to absmax quantization if Cholesky decomposition fails (numerical stability)

Image-mode calibration is required because Cosmos3's MoT architecture does not support standalone text-only inference โ€” the transformer() method requires joint und_seq + gen_seq inputs.

Usage

from diffusers import Cosmos3OmniPipeline
import torch

pipe = Cosmos3OmniPipeline.from_pretrained(
    "DuoNeural/Cosmos3-Nano-GPTQ-4bit",
    torch_dtype=torch.bfloat16,
    enable_safety_checker=False,  # Required โ€” cosmos_guardrail not bundled
)
pipe.to("cuda")

output = pipe(
    prompt="A robot arm performs a complex manipulation task.",
    num_frames=57,         # ~2.4 seconds at 24fps
    height=480, width=848,
    num_inference_steps=35,
    guidance_scale=6.0,
)
# output.video is a list of PIL frames

Memory Requirements

Format VRAM (approx.)
BF16 (original) ~30 GB
GPTQ 4-bit (this model) ~8โ€“10 GB (transformer) + ~2 GB (VAE/other)

Note: Actual VRAM depends on video resolution and frame count. The diffusion process may require additional VRAM for intermediate activations.

Dependencies

torch>=2.6.0
diffusers @ git HEAD  # requires two patches to attention_dispatch.py
transformers==4.48.3

Required diffusers patches (attention_dispatch.py):

  • Comment out @_custom_op("_diffusers_flash_attn_3::_flash_attn_forward", ...) at line 748
  • Comment out @_register_fake("_diffusers_flash_attn_3::_flash_attn_forward") at line 792

Packing Verification (2026-06-02)

Integer roundtrip test: EXACT โ€” nibble packing/unpacking is bit-perfect. The stored qweight tensors faithfully reproduce the quantized integer codes (0โ€“15) when unpacked.

Test Result
Integer roundtrip (packโ†’unpackโ†’compare) EXACT (0 mismatches)
Float16 scale precision loss < 2.85e-5 max
Zero int8 precision loss 0 (values in [0,15] range)
Reconstruction error (mean, vs original BF16) ~0.0017
Max reconstruction error ~0.046

Note on reconstruction error: The pack script re-quantized already-quantized BF16 weights (from the GPTQ pass), which constitutes double quantization. The ~0.046 max error is expected from this double quantization; it is not a packing defect. A future version could avoid this by packing integer codes directly from the GPTQ pass.

Limitations

  • 4-bit quantization may introduce quality degradation in fine video details
  • Custom nibble format โ€” not compatible with auto-gptq, exllama, or standard GPTQ loaders. Manual unpacking required (see packing verification above for the format)
  • Double quantization: pack script re-quantized already-quantized BF16 weights, adding a second rounding step (~5ร— scale bucket max error vs single-pass). Future versions should pack directly from GPTQ integer codes
  • Calibration used only 8 prompts (physical robot action descriptions) โ€” quality on very different domains may vary
  • Cosmos3's native cosmos_guardrail safety checker is not bundled; use enable_safety_checker=False
  • Action prediction head absent from Nano โ€” video generation only, no robot action outputs

Research Context

This quantization is part of DuoNeural's Cosmos3 analysis portfolio, alongside:

Key findings from our Cosmos3 probe work:

  1. Cosmos3's safety behavior emerges in the text conditioning pathway (und_seq), not the cosmos_guardrail wrapper
  2. Late-layer safety representation (direction norms grow 2โ†’24 across layers 15โ†’35)
  3. MoT architecture is DHP-immune by construction (full-attention, no recurrent state)
  4. Action head completely absent from Nano HuggingFace release

Citation

@misc{duoneural2026cosmos3gptq,
    title={Cosmos3-Nano-GPTQ-4bit: 4-bit Quantization of NVIDIA Cosmos3-Nano},
    author={Archon and Jesse and Aura},
    year={2026},
    publisher={DuoNeural},
    url={https://huggingface.co/DuoNeural/Cosmos3-Nano-GPTQ-4bit}
}

DuoNeural Research Lab | archon@agentmail.to | duoneural.com

Papers: Zenodo Community | Models: HuggingFace

Kilonova UMA Inference Test โ€” AMD 780M, 16GB VRAM (2026-06-02)

CONFIRMED WORKING on kilonova (AMD Radeon 780M, 16GB UMA, ROCm).

Measured VRAM

Stage VRAM Used
After BF16 pipeline load (mmap, CPU) ~0 GPU
After patching 329 Linear โ†’ QuantizedLinear 6.5 GB
After pipe.to("cuda") (full pipeline on GPU) 14.4 GB / 17.2 GB
During inference (peak, 20 steps) ~14.4 GB (stable)

No offloading, no ZRAM swap. The BF16 version required 30GB and thrashed swap for 18+ hours.

Inference Speed

~14โ€“15 seconds/step at 128ร—128 resolution. 20 steps โ‰ˆ 5 minutes/video on AMD 780M (6.27 TFLOPS).

Resolution Notes

  • 128ร—128, 8 frames: works, no OOM. Output quality degraded (off training distribution)
  • 256ร—256, 16 frames: OOM during VAE decode (transformer uses 13.9GB, VAE decode needs 1.69GB additional)
  • Fix for 256ร—256: use pipe.enable_model_cpu_offload() instead of pipe.to("cuda") โ€” trades VRAM for speed
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True recommended

Output Quality at 128ร—128

  • 2/3 prompts produced structured, colorful output (pixel_std 44โ€“52 on 0-255 scale)
  • 1/3 prompt produced black frames (NaN latent โ€” seed-dependent initialization issue, not a model defect)
  • Output is blurry at 128ร—128 โ€” expected, model trained at 720ร—1280

Correct Loading Approach (Kilonova / Memory-Constrained GPUs)

The GPTQ repo contains transformer weights only (packed shards + config). The pipeline components (VAE, tokenizer, scheduler) must come from a full cached pipeline. Use this pattern:

from diffusers import Cosmos3OmniPipeline
import safetensors.torch as st
import torch, torch.nn as nn, torch.nn.functional as F
from pathlib import Path
from huggingface_hub import snapshot_download

GROUP_SIZE = 128
DTYPE = torch.bfloat16

class QuantizedLinear(nn.Module):
    """DuoNeural int32 nibble-pack format (8 int4 per int32, LSB first)."""
    def __init__(self, qweight, scales, zeros, bias, out_f, in_f):
        super().__init__()
        self.register_buffer('qweight', qweight)  # (out, in//8) int32
        self.register_buffer('scales', scales)    # (out, n_groups) float16
        self.register_buffer('zeros', zeros)      # (out, n_groups) int8
        self.out_features, self.in_features = out_f, in_f
        self.bias = None if bias is None else nn.Parameter(bias.clone())

    def dequantize(self):
        qw = self.qweight
        out_f, packed_in = qw.shape
        in_f = packed_in * 8
        n_groups = in_f // GROUP_SIZE
        shifts = torch.arange(0, 32, 4, device=qw.device, dtype=torch.int32)
        nibbles = (qw.unsqueeze(-1) >> shifts.view(1, 1, 8)) & 0xF
        w = nibbles.reshape(out_f, n_groups, GROUP_SIZE).float()
        w = ((w - self.zeros.float().unsqueeze(-1)) * self.scales.float().unsqueeze(-1))
        return w.reshape(out_f, in_f).to(DTYPE)

    def forward(self, x):
        w = self.dequantize()
        out = F.linear(x, w.to(x.dtype), self.bias)
        del w
        return out

# 1. Download GPTQ packed shards
gptq_dir = snapshot_download(
    "DuoNeural/Cosmos3-Nano-GPTQ-4bit",
    cache_dir="/path/to/cache",  # must be on disk with ~6GB free, NOT tmpfs
)

# 2. Load base pipeline (needs a full cached pipeline โ€” e.g. Cosmos3-Nano-Abliterated)
pipe = Cosmos3OmniPipeline.from_pretrained(
    "DuoNeural/Cosmos3-Nano-Abliterated",   # or any full Cosmos3 pipeline
    torch_dtype=DTYPE, device_map="cpu",
    low_cpu_mem_usage=True, local_files_only=True,
    enable_safety_checker=False,
)

# 3. Load packed shards and patch transformer
packed_state = {}
for pf in sorted(Path(gptq_dir).glob("model-*-packed.safetensors")):
    packed_state.update(st.load_file(str(pf), device="cpu"))

def patch_gptq(model, state):
    params = {}
    for k, v in state.items():
        for suffix in ['.weight.qweight', '.weight.scales', '.weight.zeros']:
            if k.endswith(suffix):
                path = k[:-len(suffix)]
                field = suffix.lstrip('.weight.')
                params.setdefault(path, {})[field] = v
    
    def get_nested(m, parts):
        for p in parts: m = getattr(m, p)
        return m
    def set_nested(m, parts, v):
        for p in parts[:-1]: m = getattr(m, p)
        setattr(m, parts[-1], v)
    
    patched = 0
    for path, p in params.items():
        if len(p) < 3: continue
        parts = path.split('.')
        try: orig = get_nested(model, parts)
        except AttributeError: continue
        if not isinstance(orig, nn.Linear): continue
        qw = p['qweight']
        ql = QuantizedLinear(
            qweight=qw.to('cuda').to(torch.int32),
            scales=p['scales'].to('cuda').to(torch.float16),
            zeros=p['zeros'].to('cuda').to(torch.int8),
            bias=orig.bias, out_f=qw.shape[0], in_f=qw.shape[1]*8,
        )
        set_nested(model, parts, ql)
        patched += 1
    return model, patched

_, n = patch_gptq(pipe.transformer, packed_state)
print(f"Patched {n} linear layers")
del packed_state

# 4. Move to GPU
pipe = pipe.to("cuda")  # 14.4 GB VRAM for full pipeline
# OR: pipe.enable_model_cpu_offload()  # slower but handles any resolution

output = pipe(
    prompt="A robotic arm picks red blocks.",
    num_frames=8, height=128, width=128,
    num_inference_steps=20, guidance_scale=7.0,
)

Important: State Dict Key Format

Keys use .weight.qweight not .qweight:

layers.0.mlp.down_proj.weight.qweight   # โ† strip '.weight.qweight' for module path
layers.0.mlp.down_proj.weight.scales
layers.0.mlp.down_proj.weight.zeros

Launch Flags (AMD ROCm)

HSA_ENABLE_SDMA=0 GPU_MAX_HW_QUEUES=1 TORCHDYNAMO_DISABLE=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 your_script.py
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DuoNeural/Cosmos3-Nano-GPTQ-4bit

Quantized
(7)
this model