FLUX.2 [klein] 9B (step-distilled) β INT8 (W8A8) Transformer
Quantized transformer checkpoint for FLUX.2 [klein] 9B (step-distilled).
INT8 weight and activation quantization via NVIDIA ModelOpt with calibrated input scales.
Note: This repo contains only the quantized transformer weights. The text encoder, VAE, tokenizer, and scheduler are loaded from the base model:
black-forest-labs/FLUX.2-klein-9B.
Model Details
| Property | Value |
|---|---|
| Base Model | black-forest-labs/FLUX.2-klein-9B |
| Parameters | 9B |
| Quantization Format | INT8 (W8A8) |
| Quantization Type | Weight + Activation (W8A8) |
| Compression | ~2x vs BF16 |
| Weight dtype | int8 |
| Scale dtype | float32 |
| Key format | Single-file safetensors |
| Checkpoint | flux-2-klein-9b-int8.safetensors |
Quantization Details
| Property | Value |
|---|---|
| Framework | NVIDIA TensorRT Model Optimizer (ModelOpt) |
| Calibration Method | NVIDIA ModelOpt max (per-channel max abs) |
| Calibration Dataset | 256 samples from 768 diverse prompts (256 T2I, 256 editing, 256 composition) |
| Denoising Steps (calibration) | 4 per sample |
| Weight Quantization | Per-channel symmetric (axis=0) |
| Activation Quantization | Per-tensor via baked input_scale / weight_scale tensors |
| Preserved Layers | Embedder layers (time_embed, context_embedder, x_embedder) and output projection kept in BF16 |
Evaluation
Evaluated on 48 prompts (T2I (16 each), editing (16 each), composition (16 each)). Both BF16 baseline and INT8 outputs are generated with identical prompts and seeds, then scored independently.
π Understanding the Metrics
We report two categories of metrics:
Text-Image Alignment β measures output quality independently:
- CLIP Score β: Uses OpenAI's CLIP model to score how well each generated image matches its text prompt. Both BF16 and quantized models are evaluated independently against the same prompts β this is not a comparison between the two outputs, but an independent quality measure for each. Higher is better (typical range: 0.25β0.35).
Fidelity β measures how closely the quantized output matches the BF16 baseline:
- LPIPS β (Learned Perceptual Image Patch Similarity): Uses a neural network to judge perceptual similarity the way a human would. Unlike pixel-level metrics, LPIPS captures structural and textural differences. 0 = perceptually identical, 1 = completely different. Values below 0.1 indicate very high fidelity.
- PSNR β (Peak Signal-to-Noise Ratio): Measures pixel-level accuracy in decibels. Higher values mean less error. 20β30 dB is typical for quantized model comparisons; 30+ dB is excellent.
- FID β (FrΓ©chet Inception Distance): Compares the statistical distribution of all generated images (not individual pairs). Lower means the quantized model produces images from the same visual distribution as BF16. Sensitive to sample size β our 48-image evaluation provides a directional signal rather than a definitive score.
Text-Image Alignment (CLIP Score β)
CLIP score measures how well the generated image matches the text prompt (higher = better). Both models are evaluated independently:
| Model | CLIP Score |
|---|---|
| BF16 (baseline) | 0.6426 |
| INT8 | 0.6422 |
Fidelity vs BF16 Baseline
These metrics measure how closely the quantized output matches the BF16 reference:
| Metric | Value | Description |
|---|---|---|
| LPIPS β | 0.0615 | Perceptual distance (0 = identical) |
| PSNR β | 22.34 dB | Signal-to-noise ratio |
| FID β | 32.27 | Distribution distance |
Per-Task Breakdown
| Task | CLIP β | LPIPS β | PSNR β |
|---|---|---|---|
| Text-to-Image | 0.6549 | 0.0450 | 22.71 dB |
| Editing | 0.6279 | 0.0763 | 21.95 dB |
| Composition | 0.6440 | 0.0633 | 22.36 dB |
Comparison with FP8 (Reference)
Black Forest Labs officially provides FP8 quantized checkpoints for FLUX.2 Klein. However, FP8 (float8_e4m3fn) requires hardware support introduced with NVIDIA Ada Lovelace (RTX 40-series / L4 / L40). INT8 offers a quantized alternative at the same ~2Γ compression ratio for GPUs that lack native FP8 support (e.g., Ampere, Turing, or non-NVIDIA hardware with INT8 acceleration).
The table below compares both formats against the same BF16 baseline (CLIP 0.6426), evaluated with identical prompts and seeds:
| Metric | INT8 | FP8 |
|---|---|---|
| CLIP β | 0.6422 | 0.6419 |
| LPIPS β | 0.0615 | 0.0559 |
| PSNR β | 22.34 dB | 23.14 dB |
| FID β | 32.27 | 28.91 |
Per-Task Breakdown
| Task | INT8 CLIP β | FP8 CLIP β | INT8 LPIPS β | FP8 LPIPS β | INT8 PSNR β | FP8 PSNR β |
|---|---|---|---|---|---|---|
| Text-to-Image | 0.6549 | 0.6547 | 0.0450 | 0.0452 | 22.71 dB | 22.85 dB |
| Editing | 0.6279 | 0.6297 | 0.0763 | 0.0598 | 21.95 dB | 23.04 dB |
| Composition | 0.6440 | 0.6415 | 0.0633 | 0.0627 | 22.36 dB | 23.53 dB |
Visual Comparison (BF16 vs INT8)
All images generated with identical prompts and seeds (4 denoising steps, 1024Γ1024).
Text-to-Image
"Oil painting of a stormy seascape in the style of J.M.W. Turner, violent waves crashing against rocks, ship barely visible in mist, thick impasto texture"
Image Editing
Base: "A red sports car parked in a garage" Edit: "Change the car color to yellow and make the garage look like a futuristic space hangar"
Multi-Reference Composition (2 references)
Ref 1: "A weathered bronze statue of a Greek philosopher" Ref 2: "A lush tropical rainforest canopy" Compose: "The statue is being reclaimed by the jungle, with vines and flowers growing over its features"
INT8 Performance Benchmarks
Measured on NVIDIA RTX 5090 with PyTorch 2.10.0+cu130 and CUDA 13.0. Full INT8 stack (INT8 transformer + INT8 text encoder). Resolution: 1024Γ1024.
| Model | Steps | Eager | Compiled | Throughput | VRAM |
|---|---|---|---|---|---|
| klein-4b | 4 | 1.77s | 0.72s | 1.387 img/s | 11.25 GB |
| klein-base-4b | 50 | 33.25s | 9.92s | 0.101 img/s | 11.26 GB |
| klein-9b β | 4 | 3.04s | 1.09s | 0.917 img/s | 20.15 GB |
| klein-base-9b | 50 | 62.49s | 18.70s | 0.053 img/s | 20.16 GB |
torch.compilespeedup: 2.5Γ (klein-4b), 3.4Γ (klein-base-4b), 2.8Γ (klein-9b), 3.3Γ (klein-base-9b)Why the large speedup? Our pipeline loads INT8 weights using TorchAO, which represents linear layers as W8A8 quantized tensors. In eager mode, each quantized matmul dispatches separate CUDA kernels for dequantization and computation. With
torch.compile, the full graph is traced and these operations are fused into optimized Triton kernels that perform dequantize + matmul in a single pass, eliminating kernel launch overhead and intermediate memory traffic.
Usage
π§ Code release coming soon. A pip-installable loader library is in preparation.
Compatibility
This checkpoint uses the official FLUX.2 single-file safetensors format β the same key layout and structure used by Black Forest Labs for their official FP8 and NVFP4 quantized models. Any loader that supports quantized FLUX.2 single-file checkpoints can load this INT8 checkpoint.
License
This model inherits the license from the base model: FLUX Non-Commercial.
Acknowledgments
- Black Forest Labs for FLUX.2
- NVIDIA for ModelOpt quantization tools
- TorchAO for quantized tensor runtime
- Downloads last month
- 17
Model tree for vistralis/FLUX.2-klein-9b-INT8-transformer
Base model
black-forest-labs/FLUX.2-klein-9B

