FLUX.2-klein-4B Int8 Quantized

This is a fully int8 weight-only quantized version of black-forest-labs/FLUX.2-klein-4B using optimum-quanto.

Both the transformer and text encoder (Qwen3) are quantized.

Works on CUDA and Apple Silicon (MPS).

Memory Savings

Component Original (bf16) Quantized (int8)
Transformer 7.4 GB 3.6 GB
Text Encoder (Qwen3) 8.0 GB 4.5 GB
VAE 0.16 GB 0.16 GB
Total ~22 GB ~8.3 GB

Usage

Installation

pip install torch diffusers transformers accelerate optimum-quanto huggingface-hub safetensors
pip install git+https://github.com/huggingface/diffusers.git@flux2-klein

Note: FLUX.2-klein support requires the flux2-klein branch of diffusers until it's merged into the main release.

Quick Start

import torch
from diffusers import Flux2KleinPipeline
from transformers import Qwen3ForCausalLM, AutoTokenizer, AutoConfig
from optimum.quanto import requantize
from accelerate import init_empty_weights
from safetensors.torch import load_file
from huggingface_hub import snapshot_download, hf_hub_download
import importlib.util
import json

# Download model
model_path = snapshot_download("aydin99/FLUX.2-klein-4B-int8")

# Load wrapper class
spec = importlib.util.spec_from_file_location(
    "quantized_flux2", 
    hf_hub_download("aydin99/FLUX.2-klein-4B-int8", "quantized_flux2.py")
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
QuantizedFlux2Transformer2DModel = module.QuantizedFlux2Transformer2DModel

# Choose device
device = "cuda" if torch.cuda.is_available() else "mps"

# Load quantized transformer
qtransformer = QuantizedFlux2Transformer2DModel.from_pretrained(model_path)
qtransformer.to(device=device, dtype=torch.bfloat16)

# Load quantized text encoder
config = AutoConfig.from_pretrained(f"{model_path}/text_encoder", trust_remote_code=True)
with init_empty_weights():
    text_encoder = Qwen3ForCausalLM(config)

with open(f"{model_path}/text_encoder/quanto_qmap.json", "r") as f:
    qmap = json.load(f)
state_dict = load_file(f"{model_path}/text_encoder/model.safetensors")
requantize(text_encoder, state_dict=state_dict, quantization_map=qmap)
text_encoder.eval()
text_encoder.to(device, dtype=torch.bfloat16)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(f"{model_path}/tokenizer")

# Load pipeline (VAE + scheduler only)
pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    transformer=None,
    text_encoder=None,
    tokenizer=None,
    torch_dtype=torch.bfloat16
)

# Inject quantized components
pipe.transformer = qtransformer._wrapped
pipe.text_encoder = text_encoder
pipe.tokenizer = tokenizer
pipe.to(device)

# Generate!
image = pipe(
    prompt="A futuristic city at sunset, cyberpunk style",
    height=512,
    width=512,
    num_inference_steps=4,
    guidance_scale=0.0
).images[0]

image.save("output.png")

Performance

Hardware Speed (512x512, 4 steps)
CUDA (RTX 3090) ~2-3 seconds
Apple Silicon (M1/M2/M3) ~8 seconds

Limitations

  • Int4 not supported on MPS: PyTorch's int4 packed matmul has bugs on Apple Silicon. Int8 works on both CUDA and MPS.
  • Requires diffusers flux2-klein branch: Until FLUX.2-klein is merged into main diffusers.

Files

β”œβ”€β”€ diffusion_pytorch_model.safetensors  # Quantized transformer (3.6 GB)
β”œβ”€β”€ quanto_qmap.json                      # Transformer quantization map
β”œβ”€β”€ config.json                           # Transformer config
β”œβ”€β”€ quantized_flux2.py                    # Wrapper class
β”œβ”€β”€ text_encoder/
β”‚   β”œβ”€β”€ model.safetensors                 # Quantized Qwen3 (4.5 GB)
β”‚   β”œβ”€β”€ quanto_qmap.json                  # Text encoder quantization map
β”‚   └── config.json
β”œβ”€β”€ tokenizer/                            # Tokenizer files
└── vae/                                  # VAE (not quantized)

Credits

License

Apache 2.0 (same as base model)

Downloads last month
198
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aydin99/FLUX.2-klein-4B-int8

Quantized
(7)
this model