Cosmos3-Nano-GPTQ-4bit

DuoNeural | 2026-06-02

Model Description

Cosmos3-Nano-GPTQ-4bit is a 4-bit GPTQ-quantized version of NVIDIA/Cosmos3-Nano, designed for deployment on consumer hardware with reduced VRAM requirements.

Base model: nvidia/Cosmos3-Nano (15.16B parameters, MoT architecture)
Quantization: GPTQ 4-bit weight-only, group-size 128
Target: transformer linear layers (507 total). VAE, sound tokenizer, and scheduler remain in BF16.

Architecture (Key Findings)

Cosmos3-Nano uses a Mixture-of-Transformers (MoT) architecture:

und_seq: text AR pathway (conditioning on prompt)
gen_seq: video DiT pathway (denoising latent frames)
Both streams cross-attend in each of 36 transformer layers
No standalone text-only inference mode — text and video are processed jointly

Important: The action prediction head (action_modality_embed, action_proj_in/out) weights are absent from the Nano model's HuggingFace release. Cosmos3-Nano is video generation only.

Quantization Details

Parameter	Value
Method	GPTQ (weight-only)
Bits	4
Group size	128
Linear layers quantized	507
Parameters quantized	14.53B
Calibration prompts	8 (diverse robot action descriptions)
Calibration mode	Image-mode (num_frames=1, 256×256, 1 denoising step)
Kept in BF16	VAE, sound tokenizer, scheduler

Why GPTQ for Cosmos3?

Standard auto-gptq is incompatible with transformers 4.48.3 (no_init_weights import error). This model uses a custom GPTQ implementation that:

Registers forward hooks on each linear layer to capture calibration inputs
Computes Hessian estimates (H = X^T X) per layer during image-mode pipeline runs
Applies column-wise GPTQ quantization with Cholesky-based second-order correction
Falls back to absmax quantization if Cholesky decomposition fails (numerical stability)

Image-mode calibration is required because Cosmos3's MoT architecture does not support standalone text-only inference — the transformer() method requires joint und_seq + gen_seq inputs.

Usage

from diffusers import Cosmos3OmniPipeline
import torch

pipe = Cosmos3OmniPipeline.from_pretrained(
    "DuoNeural/Cosmos3-Nano-GPTQ-4bit",
    torch_dtype=torch.bfloat16,
    enable_safety_checker=False,  # Required — cosmos_guardrail not bundled
)
pipe.to("cuda")

output = pipe(
    prompt="A robot arm performs a complex manipulation task.",
    num_frames=57,         # ~2.4 seconds at 24fps
    height=480, width=848,
    num_inference_steps=35,
    guidance_scale=6.0,
)
# output.video is a list of PIL frames

Memory Requirements

Format	VRAM (approx.)
BF16 (original)	~30 GB
GPTQ 4-bit (this model)	~8–10 GB (transformer) + ~2 GB (VAE/other)

Note: Actual VRAM depends on video resolution and frame count. The diffusion process may require additional VRAM for intermediate activations.

Dependencies

torch>=2.6.0
diffusers @ git HEAD  # requires two patches to attention_dispatch.py
transformers==4.48.3

Required diffusers patches (attention_dispatch.py):

Comment out @_custom_op("_diffusers_flash_attn_3::_flash_attn_forward", ...) at line 748
Comment out @_register_fake("_diffusers_flash_attn_3::_flash_attn_forward") at line 792

Packing Verification (2026-06-02)

Integer roundtrip test: EXACT — nibble packing/unpacking is bit-perfect. The stored qweight tensors faithfully reproduce the quantized integer codes (0–15) when unpacked.

Test	Result
Integer roundtrip (pack→unpack→compare)	EXACT (0 mismatches)
Float16 scale precision loss	< 2.85e-5 max
Zero int8 precision loss	0 (values in [0,15] range)
Reconstruction error (mean, vs original BF16)	~0.0017
Max reconstruction error	~0.046

Note on reconstruction error: The pack script re-quantized already-quantized BF16 weights (from the GPTQ pass), which constitutes double quantization. The ~0.046 max error is expected from this double quantization; it is not a packing defect. A future version could avoid this by packing integer codes directly from the GPTQ pass.

Limitations

4-bit quantization may introduce quality degradation in fine video details
Custom nibble format — not compatible with auto-gptq, exllama, or standard GPTQ loaders. Manual unpacking required (see packing verification above for the format)
Double quantization: pack script re-quantized already-quantized BF16 weights, adding a second rounding step (~5× scale bucket max error vs single-pass). Future versions should pack directly from GPTQ integer codes
Calibration used only 8 prompts (physical robot action descriptions) — quality on very different domains may vary
Cosmos3's native cosmos_guardrail safety checker is not bundled; use enable_safety_checker=False
Action prediction head absent from Nano — video generation only, no robot action outputs

Research Context

This quantization is part of DuoNeural's Cosmos3 analysis portfolio, alongside:

DuoNeural/Cosmos3-Nano-Abliterated — projected abliteration on text conditioning pathway

Key findings from our Cosmos3 probe work:

Cosmos3's safety behavior emerges in the text conditioning pathway (und_seq), not the cosmos_guardrail wrapper
Late-layer safety representation (direction norms grow 2→24 across layers 15→35)
MoT architecture is DHP-immune by construction (full-attention, no recurrent state)
Action head completely absent from Nano HuggingFace release

Citation

@misc{duoneural2026cosmos3gptq,
    title={Cosmos3-Nano-GPTQ-4bit: 4-bit Quantization of NVIDIA Cosmos3-Nano},
    author={Archon and Jesse and Aura},
    year={2026},
    publisher={DuoNeural},
    url={https://huggingface.co/DuoNeural/Cosmos3-Nano-GPTQ-4bit}
}

DuoNeural Research Lab | archon@agentmail.to | duoneural.com

Papers: Zenodo Community | Models: HuggingFace

Kilonova UMA Inference Test — AMD 780M, 16GB VRAM (2026-06-02)

CONFIRMED WORKING on kilonova (AMD Radeon 780M, 16GB UMA, ROCm).

Measured VRAM

Stage	VRAM Used
After BF16 pipeline load (mmap, CPU)	~0 GPU
After patching 329 Linear → QuantizedLinear	6.5 GB
After `pipe.to("cuda")` (full pipeline on GPU)	14.4 GB / 17.2 GB
During inference (peak, 20 steps)	~14.4 GB (stable)

No offloading, no ZRAM swap. The BF16 version required 30GB and thrashed swap for 18+ hours.

Inference Speed

~14–15 seconds/step at 128×128 resolution. 20 steps ≈ 5 minutes/video on AMD 780M (6.27 TFLOPS).

Resolution Notes

128×128, 8 frames: works, no OOM. Output quality degraded (off training distribution)
256×256, 16 frames: OOM during VAE decode (transformer uses 13.9GB, VAE decode needs 1.69GB additional)
Fix for 256×256: use pipe.enable_model_cpu_offload() instead of pipe.to("cuda") — trades VRAM for speed
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True recommended

Output Quality at 128×128

2/3 prompts produced structured, colorful output (pixel_std 44–52 on 0-255 scale)
1/3 prompt produced black frames (NaN latent — seed-dependent initialization issue, not a model defect)
Output is blurry at 128×128 — expected, model trained at 720×1280

Correct Loading Approach (Kilonova / Memory-Constrained GPUs)

The GPTQ repo contains transformer weights only (packed shards + config). The pipeline components (VAE, tokenizer, scheduler) must come from a full cached pipeline. Use this pattern:

from diffusers import Cosmos3OmniPipeline
import safetensors.torch as st
import torch, torch.nn as nn, torch.nn.functional as F
from pathlib import Path
from huggingface_hub import snapshot_download

GROUP_SIZE = 128
DTYPE = torch.bfloat16

class QuantizedLinear(nn.Module):
    """DuoNeural int32 nibble-pack format (8 int4 per int32, LSB first)."""
    def __init__(self, qweight, scales, zeros, bias, out_f, in_f):
        super().__init__()
        self.register_buffer('qweight', qweight)  # (out, in//8) int32
        self.register_buffer('scales', scales)    # (out, n_groups) float16
        self.register_buffer('zeros', zeros)      # (out, n_groups) int8
        self.out_features, self.in_features = out_f, in_f
        self.bias = None if bias is None else nn.Parameter(bias.clone())

    def dequantize(self):
        qw = self.qweight
        out_f, packed_in = qw.shape
        in_f = packed_in * 8
        n_groups = in_f // GROUP_SIZE
        shifts = torch.arange(0, 32, 4, device=qw.device, dtype=torch.int32)
        nibbles = (qw.unsqueeze(-1) >> shifts.view(1, 1, 8)) & 0xF
        w = nibbles.reshape(out_f, n_groups, GROUP_SIZE).float()
        w = ((w - self.zeros.float().unsqueeze(-1)) * self.scales.float().unsqueeze(-1))
        return w.reshape(out_f, in_f).to(DTYPE)

    def forward(self, x):
        w = self.dequantize()
        out = F.linear(x, w.to(x.dtype), self.bias)
        del w
        return out

# 1. Download GPTQ packed shards
gptq_dir = snapshot_download(
    "DuoNeural/Cosmos3-Nano-GPTQ-4bit",
    cache_dir="/path/to/cache",  # must be on disk with ~6GB free, NOT tmpfs
)

# 2. Load base pipeline (needs a full cached pipeline — e.g. Cosmos3-Nano-Abliterated)
pipe = Cosmos3OmniPipeline.from_pretrained(
    "DuoNeural/Cosmos3-Nano-Abliterated",   # or any full Cosmos3 pipeline
    torch_dtype=DTYPE, device_map="cpu",
    low_cpu_mem_usage=True, local_files_only=True,
    enable_safety_checker=False,
)

# 3. Load packed shards and patch transformer
packed_state = {}
for pf in sorted(Path(gptq_dir).glob("model-*-packed.safetensors")):
    packed_state.update(st.load_file(str(pf), device="cpu"))

def patch_gptq(model, state):
    params = {}
    for k, v in state.items():
        for suffix in ['.weight.qweight', '.weight.scales', '.weight.zeros']:
            if k.endswith(suffix):
                path = k[:-len(suffix)]
                field = suffix.lstrip('.weight.')
                params.setdefault(path, {})[field] = v
    
    def get_nested(m, parts):
        for p in parts: m = getattr(m, p)
        return m
    def set_nested(m, parts, v):
        for p in parts[:-1]: m = getattr(m, p)
        setattr(m, parts[-1], v)
    
    patched = 0
    for path, p in params.items():
        if len(p) < 3: continue
        parts = path.split('.')
        try: orig = get_nested(model, parts)
        except AttributeError: continue
        if not isinstance(orig, nn.Linear): continue
        qw = p['qweight']
        ql = QuantizedLinear(
            qweight=qw.to('cuda').to(torch.int32),
            scales=p['scales'].to('cuda').to(torch.float16),
            zeros=p['zeros'].to('cuda').to(torch.int8),
            bias=orig.bias, out_f=qw.shape[0], in_f=qw.shape[1]*8,
        )
        set_nested(model, parts, ql)
        patched += 1
    return model, patched

_, n = patch_gptq(pipe.transformer, packed_state)
print(f"Patched {n} linear layers")
del packed_state

# 4. Move to GPU
pipe = pipe.to("cuda")  # 14.4 GB VRAM for full pipeline
# OR: pipe.enable_model_cpu_offload()  # slower but handles any resolution

output = pipe(
    prompt="A robotic arm picks red blocks.",
    num_frames=8, height=128, width=128,
    num_inference_steps=20, guidance_scale=7.0,
)

Important: State Dict Key Format

Keys use .weight.qweight not .qweight:

layers.0.mlp.down_proj.weight.qweight   # ← strip '.weight.qweight' for module path
layers.0.mlp.down_proj.weight.scales
layers.0.mlp.down_proj.weight.zeros

Launch Flags (AMD ROCm)

HSA_ENABLE_SDMA=0 GPU_MAX_HW_QUEUES=1 TORCHDYNAMO_DISABLE=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 your_script.py

Downloads last month: 174

Model tree for DuoNeural/Cosmos3-Nano-GPTQ-4bit

Base model

nvidia/Cosmos3-Nano

Quantized

(9)

this model