Krea 2 Turbo SDNQ UINT4

SDNQ UINT4 quantization of krea/Krea-2-Turbo for Diffusers Krea2Pipeline.

What Is Quantized

Selected recipe: uint4-static-transformer-only.

Quantized components: transformer. Tokenizer, scheduler, and non-selected pipeline components are copied from the original Diffusers pipeline.

The initial smoke sweep also tried SDNQ packing for the text encoder, but standard Diffusers/Transformers loading rejected the packed Qwen3VLModel text-encoder weights. This release therefore keeps the text encoder loadable in bf16 and quantizes the Krea transformer only.

Benchmark Setup

Pipeline: Krea2Pipeline
Resolution: 2048x2048
Steps: 8
Guidance scale: 0.0
Seed base: 61000
Distilled mode: true
Torch dtype: bfloat16
Attention backend: chunked-native query attention, chunk_size=1024
Prompt set: 10 prompts covering simple scenes, public-domain style stress tests, tricky composition, long Latin text, long Cyrillic text, and mixed Latin/Cyrillic diagrams
Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition on a disposable RunPod pod with local container disk

Benchmark Summary

Model	Load	First gen	Hot mean	Hot max	Load GPU peak	Gen GPU peak	Torch peak
original	7.019 s	91.147 s	90.487 s	90.712 s	33553 MB	52313 MB	50135.29150390625 MB
uint4-static-transformer-only	4.792 s	90.243 s	88.561 s	88.776 s	16189 MB	34865 MB	32690.51123046875 MB

Storage size of this release directory: 15.46 GB. Quantized local checkpoint size before packaging: 15.36 GB.

Raw per-prompt metrics are available in benchmark/*.csv and benchmark/*.jsonl. The combined benchmark summary is in benchmark/summary.json.

Usage

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors huggingface_hub sdnq

import torch
from diffusers import Krea2Pipeline
from sdnq.loader import apply_sdnq_options_to_model

repo_id = "WaveCut/Krea-2-Turbo-SDNQ-uint4"
device = "cuda"

pipe = Krea2Pipeline.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
    is_distilled=True,
)

for name in ['transformer']:
    module = getattr(pipe, name, None)
    if module is not None:
        setattr(
            pipe,
            name,
            apply_sdnq_options_to_model(module, dtype=torch.bfloat16, use_quantized_matmul=True),
        )

pipe.to(device)
image = pipe(
    prompt="A clean technical poster with readable labels",
    height=2048,
    width=2048,
    num_inference_steps=8,
    guidance_scale=0.0,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("krea2-sdnq.png")

Quantization Recipe

{
  "dynamic_loss_threshold": null,
  "modules": [
    "transformer"
  ],
  "name": "uint4-static-transformer-only",
  "quant_conv": false,
  "quant_embedding": false,
  "svd_rank": 32,
  "svd_steps": 32,
  "use_dynamic_quantization": false,
  "use_svd": false,
  "weights_dtype": "uint4"
}

The checkpoint was produced by loading the original Diffusers pipeline, applying sdnq_post_load_quant only to the listed pipeline components, and saving with save_sdnq_model(..., is_pipeline=True).

Limitations

This is a quantized derivative and inherits the base model behavior, limits, and license terms.
The comparison set is a deployment smoke benchmark, not a preference study or FID evaluation.
Long text, small labels, and mixed Cyrillic/Latin diagrams should be inspected manually before production use.
Benchmark numbers depend on GPU, driver, PyTorch, Diffusers, SDNQ, and CUDA versions.

License

This repository contains a quantized derivative of krea/Krea-2-Turbo. Upstream license material copied during packaging: LICENSE.pdf. Review the upstream Krea model card and license before use or redistribution.

Downloads last month: 103

Model tree for WaveCut/Krea-2-Turbo-SDNQ-uint4

Base model

krea/Krea-2-Raw

Finetuned

krea/Krea-2-Turbo

Quantized

(17)

this model