FLUX.2 Klein 9B SDNQ UINT4 Static

Static UINT4 SDNQ quantization of black-forest-labs/FLUX.2-klein-9B.

This checkpoint was selected as a practical deployment-oriented variant because it was the fastest option in the A40 benchmark and used substantially less VRAM than the original BF16 pipeline, while visual quality differences were minor in the prompt-following stress comparison.

Related checkpoint: for a quality-oriented dynamic SVD alternative with a modest latency and VRAM tradeoff, see WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32.

Full resolution comparison canvas

The image above is a compressed WebP version of a 1:1 comparison canvas. It contains the original FLUX.2 Klein 9B, the previous SDNQ baseline, this uint4-static checkpoint, and a quality-oriented dynamic SVD candidate across text-heavy prompts including an additional Russian-only chalkboard prompt.

Why This Variant

We compared broad SDNQ 4-bit recipes across speed, VRAM, and visual quality. This uint4-static recipe was chosen because it gives the best deployment tradeoff:

  • Lowest latency among the final candidates in the single-process benchmark.
  • Low runtime VRAM in a 1024x1024, 4-step image-generation pipeline.
  • Much smaller full-pipeline checkpoint footprint than the original BF16 FLUX.2 Klein 9B checkpoint in the measured setup.
  • Visual differences versus the baseline and the original model were small in the stress set, including long text, signs, labels, small details, and a Russian chalkboard prompt.

Benchmark Setup

Measurements below use a single NVIDIA A40 test host and a consistent Flux2KleinPipeline inference harness.

  • GPU: NVIDIA A40 46 GB
  • Resolution: 1024x1024
  • Steps: 4
  • Guidance scale: 0.0
  • Torch dtype: bfloat16
  • Quantized matmul: enabled for SDNQ inference comparisons
  • Batch/concurrency: single process

These are deployment-oriented measurements for one hardware/software setup.

Candidate Benchmark

Single-process inference metrics for the final candidate set:

Variant Warm avg GPU peak CUDA allocated
uint4-static 3.826 s 14.8 GB 14.1 GB
int4-dynamic-th0p1-svd-r16-s32-g128 4.020 s 14.3 GB 13.5 GB
uint4-static-svd-r32-s32 4.070 s 14.7 GB 13.9 GB
float4_e4m0fnu-dynamic-th0p1-svd-r16-s32 4.116 s 16.0 GB 15.3 GB
float4_e4m0fnu-dynamic-th0p01-svd-r128-s32 4.185 s 17.2 GB 16.5 GB

Stress Comparison

This stress set contains 9 prompts with signs, chalkboards, posters, labels, timetables, small props, and a Russian-only chalkboard prompt. Each row was run twice; the table reports the warm run average.

Model Warm avg GPU peak CUDA allocated Prompt count
Original FLUX.2-klein-9B BF16 pipeline 4.244 s 36.3 GB 35.6 GB 9
Previous SDNQ baseline 4.079 s 15.2 GB 14.5 GB 9
This uint4-static checkpoint 3.866 s 14.8 GB 14.1 GB 9
Dynamic SVD r128 quality candidate 4.182 s 17.2 GB 16.5 GB 9

The model-card image is a WebP copy optimized from the full-resolution comparison canvas:

WebP quality Size RGB PSNR Luma SSIM-like score
85 5.72 MB 46.93 dB 0.999977

The source JPEG canvas was about 13 MB; this WebP version is smaller while remaining visually close to the original artifact.

Model Size

Approximate full-pipeline folder sizes in the measured setup:

Checkpoint Folder size
Original black-forest-labs/FLUX.2-klein-9B 52.9 GB
Previous SDNQ baseline 12.6 GB
This uint4-static checkpoint 12.2 GB
Dynamic SVD r128 candidate 14.7 GB

Usage

Install current Diffusers and SDNQ:

pip install git+https://github.com/huggingface/diffusers.git
pip install sdnq

Run with Flux2KleinPipeline:

import torch
from diffusers import Flux2KleinPipeline
from sdnq import SDNQConfig  # registers SDNQ support in diffusers/transformers
from sdnq.common import use_torch_compile as triton_is_available
from sdnq.loader import apply_sdnq_options_to_model

repo_id = "WaveCut/FLUX.2-klein-9B-SDNQ-uint4-static"
device = "cuda"

pipe = Flux2KleinPipeline.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
)

if triton_is_available and torch.cuda.is_available():
    pipe.transformer = apply_sdnq_options_to_model(
        pipe.transformer,
        use_quantized_matmul=True,
    )
    pipe.text_encoder = apply_sdnq_options_to_model(
        pipe.text_encoder,
        use_quantized_matmul=True,
    )

pipe.to(device)

prompt = "A clean editorial poster with large readable text: OPEN SOURCE IMAGE MODEL"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=4,
    guidance_scale=0.0,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]

image.save("flux2-klein-sdnq-uint4-static.png")

The same pipeline also supports image editing:

from diffusers.utils import load_image

input_image = load_image("input.png")
image = pipe(
    image=input_image,
    prompt="Turn the handwritten sign into a clean printed sign while preserving the scene",
    height=1024,
    width=1024,
    num_inference_steps=4,
    guidance_scale=0.0,
    generator=torch.Generator(device=device).manual_seed(1),
).images[0]
image.save("flux2-klein-sdnq-uint4-static-edit.png")

If your GPU has less VRAM, replace pipe.to(device) with pipe.enable_model_cpu_offload().

Quantization Recipe

This checkpoint was produced with SDNQ post-load quantization over the transformer and text_encoder components of FLUX.2 Klein 9B.

Recipe:

variant = {
    "weights_dtype": "uint4",
    "use_dynamic_quantization": False,
    "dynamic_loss_threshold": None,
    "use_svd": False,
    "svd_rank": 32,   # unused because use_svd is False
    "svd_steps": 8,   # unused because use_svd is False
    "group_size": 0,
    "dequantize_fp32": False,
    "quantized_matmul_dtype": None,
    "use_quantized_matmul": False,
    "use_stochastic_rounding": False,
}

Minimal quantization sketch:

import torch
from diffusers import Flux2KleinPipeline
from sdnq import sdnq_post_load_quant
from sdnq.loader import save_sdnq_model

base_model = "black-forest-labs/FLUX.2-klein-9B"
pipe = Flux2KleinPipeline.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
)

common_kwargs = dict(
    weights_dtype="uint4",
    torch_dtype=torch.bfloat16,
    group_size=0,
    svd_rank=32,
    svd_steps=8,
    dynamic_loss_threshold=None,
    use_svd=False,
    quant_conv=False,
    quant_embedding=False,
    use_quantized_matmul=False,
    use_quantized_matmul_conv=False,
    use_dynamic_quantization=False,
    use_stochastic_rounding=False,
    dequantize_fp32=False,
    non_blocking=True,
    add_skip_keys=True,
    quantization_device="cuda",
    return_device="cuda",
)

pipe.transformer = sdnq_post_load_quant(pipe.transformer, **common_kwargs)
pipe.text_encoder = sdnq_post_load_quant(pipe.text_encoder, **common_kwargs)

save_sdnq_model(
    pipe,
    "FLUX.2-klein-9B-SDNQ-uint4-static",
    max_shard_size="5GB",
    is_pipeline=True,
)

Limitations

  • This is a quantized derivative of FLUX.2 Klein 9B; it inherits the base model's limitations and acceptable-use requirements.
  • Text rendering can still be inaccurate, especially for long strings or small background text.
  • The quality comparison here is visual prompt-following evaluation, not a large-scale human preference or FID benchmark.
  • Benchmarks were run on an A40 test host and should be validated again for your exact serving stack.

License

This model is a quantized derivative of black-forest-labs/FLUX.2-klein-9B and follows the FLUX Non-Commercial License. Please review LICENSE.md and the Black Forest Labs acceptable-use policy before use.

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/FLUX.2-klein-9B-SDNQ-uint4-static

Quantized
(23)
this model