FLUX.2 [klein] 9B (step-distilled) β€” INT8 (W8A8) Transformer

Quantized transformer checkpoint for FLUX.2 [klein] 9B (step-distilled).

INT8 weight and activation quantization via NVIDIA ModelOpt with calibrated input scales.

Note: This repo contains only the quantized transformer weights. The text encoder, VAE, tokenizer, and scheduler are loaded from the base model: black-forest-labs/FLUX.2-klein-9B.

Model Details

Property Value
Base Model black-forest-labs/FLUX.2-klein-9B
Parameters 9B
Quantization Format INT8 (W8A8)
Quantization Type Weight + Activation (W8A8)
Compression ~2x vs BF16
Weight dtype int8
Scale dtype float32
Key format Single-file safetensors
Checkpoint flux-2-klein-9b-int8.safetensors

Quantization Details

Property Value
Framework NVIDIA TensorRT Model Optimizer (ModelOpt)
Calibration Method NVIDIA ModelOpt max (per-channel max abs)
Calibration Dataset 256 samples from 768 diverse prompts (256 T2I, 256 editing, 256 composition)
Denoising Steps (calibration) 4 per sample
Weight Quantization Per-channel symmetric (axis=0)
Activation Quantization Per-tensor via baked input_scale / weight_scale tensors
Preserved Layers Embedder layers (time_embed, context_embedder, x_embedder) and output projection kept in BF16

Evaluation

Evaluated on 48 prompts (T2I (16 each), editing (16 each), composition (16 each)). Both BF16 baseline and INT8 outputs are generated with identical prompts and seeds, then scored independently.

πŸ“Š Understanding the Metrics

We report two categories of metrics:

Text-Image Alignment β€” measures output quality independently:

  • CLIP Score ↑: Uses OpenAI's CLIP model to score how well each generated image matches its text prompt. Both BF16 and quantized models are evaluated independently against the same prompts β€” this is not a comparison between the two outputs, but an independent quality measure for each. Higher is better (typical range: 0.25–0.35).

Fidelity β€” measures how closely the quantized output matches the BF16 baseline:

  • LPIPS ↓ (Learned Perceptual Image Patch Similarity): Uses a neural network to judge perceptual similarity the way a human would. Unlike pixel-level metrics, LPIPS captures structural and textural differences. 0 = perceptually identical, 1 = completely different. Values below 0.1 indicate very high fidelity.
  • PSNR ↑ (Peak Signal-to-Noise Ratio): Measures pixel-level accuracy in decibels. Higher values mean less error. 20–30 dB is typical for quantized model comparisons; 30+ dB is excellent.
  • FID ↓ (FrΓ©chet Inception Distance): Compares the statistical distribution of all generated images (not individual pairs). Lower means the quantized model produces images from the same visual distribution as BF16. Sensitive to sample size β€” our 48-image evaluation provides a directional signal rather than a definitive score.

Text-Image Alignment (CLIP Score ↑)

CLIP score measures how well the generated image matches the text prompt (higher = better). Both models are evaluated independently:

Model CLIP Score
BF16 (baseline) 0.6426
INT8 0.6422

Fidelity vs BF16 Baseline

These metrics measure how closely the quantized output matches the BF16 reference:

Metric Value Description
LPIPS ↓ 0.0615 Perceptual distance (0 = identical)
PSNR ↑ 22.34 dB Signal-to-noise ratio
FID ↓ 32.27 Distribution distance

Per-Task Breakdown

Task CLIP ↑ LPIPS ↓ PSNR ↑
Text-to-Image 0.6549 0.0450 22.71 dB
Editing 0.6279 0.0763 21.95 dB
Composition 0.6440 0.0633 22.36 dB

Comparison with FP8 (Reference)

Black Forest Labs officially provides FP8 quantized checkpoints for FLUX.2 Klein. However, FP8 (float8_e4m3fn) requires hardware support introduced with NVIDIA Ada Lovelace (RTX 40-series / L4 / L40). INT8 offers a quantized alternative at the same ~2Γ— compression ratio for GPUs that lack native FP8 support (e.g., Ampere, Turing, or non-NVIDIA hardware with INT8 acceleration).

The table below compares both formats against the same BF16 baseline (CLIP 0.6426), evaluated with identical prompts and seeds:

Metric INT8 FP8
CLIP ↑ 0.6422 0.6419
LPIPS ↓ 0.0615 0.0559
PSNR ↑ 22.34 dB 23.14 dB
FID ↓ 32.27 28.91

Per-Task Breakdown

Task INT8 CLIP ↑ FP8 CLIP ↑ INT8 LPIPS ↓ FP8 LPIPS ↓ INT8 PSNR ↑ FP8 PSNR ↑
Text-to-Image 0.6549 0.6547 0.0450 0.0452 22.71 dB 22.85 dB
Editing 0.6279 0.6297 0.0763 0.0598 21.95 dB 23.04 dB
Composition 0.6440 0.6415 0.0633 0.0627 22.36 dB 23.53 dB

Visual Comparison (BF16 vs INT8)

All images generated with identical prompts and seeds (4 denoising steps, 1024Γ—1024).

Text-to-Image

"Oil painting of a stormy seascape in the style of J.M.W. Turner, violent waves crashing against rocks, ship barely visible in mist, thick impasto texture"

Text-to-Image: BF16 vs INT8

Image Editing

Base: "A red sports car parked in a garage" Edit: "Change the car color to yellow and make the garage look like a futuristic space hangar"

Image Editing: BF16 vs INT8

Multi-Reference Composition (2 references)

Ref 1: "A weathered bronze statue of a Greek philosopher" Ref 2: "A lush tropical rainforest canopy" Compose: "The statue is being reclaimed by the jungle, with vines and flowers growing over its features"

Multi-Reference Composition (2 references): BF16 vs INT8

INT8 Performance Benchmarks

Measured on NVIDIA RTX 5090 with PyTorch 2.10.0+cu130 and CUDA 13.0. Full INT8 stack (INT8 transformer + INT8 text encoder). Resolution: 1024Γ—1024.

ModelStepsEagerCompiledThroughputVRAM
klein-4b41.77s0.72s1.387 img/s11.25 GB
klein-base-4b5033.25s9.92s0.101 img/s11.26 GB
klein-9b β—‚43.04s1.09s0.917 img/s20.15 GB
klein-base-9b5062.49s18.70s0.053 img/s20.16 GB

torch.compile speedup: 2.5Γ— (klein-4b), 3.4Γ— (klein-base-4b), 2.8Γ— (klein-9b), 3.3Γ— (klein-base-9b)

Why the large speedup? Our pipeline loads INT8 weights using TorchAO, which represents linear layers as W8A8 quantized tensors. In eager mode, each quantized matmul dispatches separate CUDA kernels for dequantization and computation. With torch.compile, the full graph is traced and these operations are fused into optimized Triton kernels that perform dequantize + matmul in a single pass, eliminating kernel launch overhead and intermediate memory traffic.

Usage

🚧 Code release coming soon. A pip-installable loader library is in preparation.

Compatibility

This checkpoint uses the official FLUX.2 single-file safetensors format β€” the same key layout and structure used by Black Forest Labs for their official FP8 and NVFP4 quantized models. Any loader that supports quantized FLUX.2 single-file checkpoints can load this INT8 checkpoint.

License

This model inherits the license from the base model: FLUX Non-Commercial.

Acknowledgments

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vistralis/FLUX.2-klein-9b-INT8-transformer

Quantized
(11)
this model