FLUX.2 [klein] 9B (step-distilled) — INT8 (W8A8) Transformer

Quantized transformer checkpoint for FLUX.2 [klein] 9B (step-distilled).

INT8 weight and activation quantization via NVIDIA ModelOpt with calibrated input scales.

Note: This repo contains only the quantized transformer weights. The text encoder, VAE, tokenizer, and scheduler are loaded from the base model: black-forest-labs/FLUX.2-klein-9B.

Model Details

Property	Value
Base Model	black-forest-labs/FLUX.2-klein-9B
Parameters	9B
Quantization Format	INT8 (W8A8)
Quantization Type	Weight + Activation (W8A8)
Compression	~2x vs BF16
Weight dtype	`int8`
Scale dtype	`float32`
Key format	Single-file safetensors
Checkpoint	`flux-2-klein-9b-int8.safetensors`

Quantization Details

Property	Value
Framework	NVIDIA TensorRT Model Optimizer (ModelOpt)
Calibration Method	NVIDIA ModelOpt `max` (per-channel max abs)
Calibration Dataset	256 samples from 768 diverse prompts (256 T2I, 256 editing, 256 composition)
Denoising Steps (calibration)	4 per sample
Weight Quantization	Per-channel symmetric (axis=0)
Activation Quantization	Per-tensor via baked `input_scale` / `weight_scale` tensors
Preserved Layers	Embedder layers (time_embed, context_embedder, x_embedder) and output projection kept in BF16

Evaluation

Evaluated on 48 prompts (T2I (16 each), editing (16 each), composition (16 each)). Both BF16 baseline and INT8 outputs are generated with identical prompts and seeds, then scored independently.

📊 Understanding the Metrics

We report two categories of metrics:

Text-Image Alignment — measures output quality independently:

CLIP Score ↑: Uses OpenAI's CLIP model to score how well each generated image matches its text prompt. Both BF16 and quantized models are evaluated independently against the same prompts — this is not a comparison between the two outputs, but an independent quality measure for each. Higher is better (typical range: 0.25–0.35).

Fidelity — measures how closely the quantized output matches the BF16 baseline:

LPIPS ↓ (Learned Perceptual Image Patch Similarity): Uses a neural network to judge perceptual similarity the way a human would. Unlike pixel-level metrics, LPIPS captures structural and textural differences. 0 = perceptually identical, 1 = completely different. Values below 0.1 indicate very high fidelity.
PSNR ↑ (Peak Signal-to-Noise Ratio): Measures pixel-level accuracy in decibels. Higher values mean less error. 20–30 dB is typical for quantized model comparisons; 30+ dB is excellent.
FID ↓ (Fréchet Inception Distance): Compares the statistical distribution of all generated images (not individual pairs). Lower means the quantized model produces images from the same visual distribution as BF16. Sensitive to sample size — our 48-image evaluation provides a directional signal rather than a definitive score.

Text-Image Alignment (CLIP Score ↑)

CLIP score measures how well the generated image matches the text prompt (higher = better). Both models are evaluated independently:

Model	CLIP Score
BF16 (baseline)	0.6426
INT8	0.6422

Fidelity vs BF16 Baseline

These metrics measure how closely the quantized output matches the BF16 reference:

Metric	Value	Description
LPIPS ↓	0.0615	Perceptual distance (0 = identical)
PSNR ↑	22.34 dB	Signal-to-noise ratio
FID ↓	32.27	Distribution distance

Per-Task Breakdown

Task	CLIP ↑	LPIPS ↓	PSNR ↑
Text-to-Image	0.6549	0.0450	22.71 dB
Editing	0.6279	0.0763	21.95 dB
Composition	0.6440	0.0633	22.36 dB

Comparison with FP8 (Reference)

Black Forest Labs officially provides FP8 quantized checkpoints for FLUX.2 Klein. However, FP8 (float8_e4m3fn) requires hardware support introduced with NVIDIA Ada Lovelace (RTX 40-series / L4 / L40). INT8 offers a quantized alternative at the same ~2× compression ratio for GPUs that lack native FP8 support (e.g., Ampere, Turing, or non-NVIDIA hardware with INT8 acceleration).

The table below compares both formats against the same BF16 baseline (CLIP 0.6426), evaluated with identical prompts and seeds:

Metric	INT8	FP8
CLIP ↑	0.6422	0.6419
LPIPS ↓	0.0615	0.0559
PSNR ↑	22.34 dB	23.14 dB
FID ↓	32.27	28.91

Per-Task Breakdown

Task	INT8 CLIP ↑	FP8 CLIP ↑	INT8 LPIPS ↓	FP8 LPIPS ↓	INT8 PSNR ↑	FP8 PSNR ↑
Text-to-Image	0.6549	0.6547	0.0450	0.0452	22.71 dB	22.85 dB
Editing	0.6279	0.6297	0.0763	0.0598	21.95 dB	23.04 dB
Composition	0.6440	0.6415	0.0633	0.0627	22.36 dB	23.53 dB

Visual Comparison (BF16 vs INT8)

All images generated with identical prompts and seeds (4 denoising steps, 1024×1024).

Text-to-Image

"Oil painting of a stormy seascape in the style of J.M.W. Turner, violent waves crashing against rocks, ship barely visible in mist, thick impasto texture"

Image Editing

Base: "A red sports car parked in a garage" Edit: "Change the car color to yellow and make the garage look like a futuristic space hangar"

Multi-Reference Composition (2 references)

Ref 1: "A weathered bronze statue of a Greek philosopher" Ref 2: "A lush tropical rainforest canopy" Compose: "The statue is being reclaimed by the jungle, with vines and flowers growing over its features"

INT8 Performance Benchmarks

Measured on NVIDIA RTX 5090 with PyTorch 2.10.0+cu130 and CUDA 13.0. Full INT8 stack (INT8 transformer + INT8 text encoder). Resolution: 1024×1024.

Model	Steps	Eager	Compiled	Throughput	VRAM
klein-4b	4	1.77s	0.72s	1.387 img/s	11.25 GB
klein-base-4b	50	33.25s	9.92s	0.101 img/s	11.26 GB
klein-9b ◂	4	3.04s	1.09s	0.917 img/s	20.15 GB
klein-base-9b	50	62.49s	18.70s	0.053 img/s	20.16 GB

torch.compile speedup: 2.5× (klein-4b), 3.4× (klein-base-4b), 2.8× (klein-9b), 3.3× (klein-base-9b)

Why the large speedup? Our pipeline loads INT8 weights using TorchAO, which represents linear layers as W8A8 quantized tensors. In eager mode, each quantized matmul dispatches separate CUDA kernels for dequantization and computation. With torch.compile, the full graph is traced and these operations are fused into optimized Triton kernels that perform dequantize + matmul in a single pass, eliminating kernel launch overhead and intermediate memory traffic.

Usage

🚧 Code release coming soon. A pip-installable loader library is in preparation.

Compatibility

This checkpoint uses the official FLUX.2 single-file safetensors format — the same key layout and structure used by Black Forest Labs for their official FP8 and NVFP4 quantized models. Any loader that supports quantized FLUX.2 single-file checkpoints can load this INT8 checkpoint.

License

This model inherits the license from the base model: FLUX Non-Commercial.

Acknowledgments

Black Forest Labs for FLUX.2
NVIDIA for ModelOpt quantization tools
TorchAO for quantized tensor runtime

Downloads last month: 17

Model tree for vistralis/FLUX.2-klein-9b-INT8-transformer

Base model

black-forest-labs/FLUX.2-klein-9B

Quantized

(11)

this model