FLUX.2-klein-4B Int8 Quantized
This is a fully int8 weight-only quantized version of black-forest-labs/FLUX.2-klein-4B using optimum-quanto.
Both the transformer and text encoder (Qwen3) are quantized.
Works on CUDA and Apple Silicon (MPS).
Memory Savings
| Component | Original (bf16) | Quantized (int8) |
|---|---|---|
| Transformer | 7.4 GB | 3.6 GB |
| Text Encoder (Qwen3) | 8.0 GB | 4.5 GB |
| VAE | 0.16 GB | 0.16 GB |
| Total | ~22 GB | ~8.3 GB |
Usage
Installation
pip install torch diffusers transformers accelerate optimum-quanto huggingface-hub safetensors
pip install git+https://github.com/huggingface/diffusers.git@flux2-klein
Note: FLUX.2-klein support requires the
flux2-kleinbranch of diffusers until it's merged into the main release.
Quick Start
import torch
from diffusers import Flux2KleinPipeline
from transformers import Qwen3ForCausalLM, AutoTokenizer, AutoConfig
from optimum.quanto import requantize
from accelerate import init_empty_weights
from safetensors.torch import load_file
from huggingface_hub import snapshot_download, hf_hub_download
import importlib.util
import json
# Download model
model_path = snapshot_download("aydin99/FLUX.2-klein-4B-int8")
# Load wrapper class
spec = importlib.util.spec_from_file_location(
"quantized_flux2",
hf_hub_download("aydin99/FLUX.2-klein-4B-int8", "quantized_flux2.py")
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
QuantizedFlux2Transformer2DModel = module.QuantizedFlux2Transformer2DModel
# Choose device
device = "cuda" if torch.cuda.is_available() else "mps"
# Load quantized transformer
qtransformer = QuantizedFlux2Transformer2DModel.from_pretrained(model_path)
qtransformer.to(device=device, dtype=torch.bfloat16)
# Load quantized text encoder
config = AutoConfig.from_pretrained(f"{model_path}/text_encoder", trust_remote_code=True)
with init_empty_weights():
text_encoder = Qwen3ForCausalLM(config)
with open(f"{model_path}/text_encoder/quanto_qmap.json", "r") as f:
qmap = json.load(f)
state_dict = load_file(f"{model_path}/text_encoder/model.safetensors")
requantize(text_encoder, state_dict=state_dict, quantization_map=qmap)
text_encoder.eval()
text_encoder.to(device, dtype=torch.bfloat16)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(f"{model_path}/tokenizer")
# Load pipeline (VAE + scheduler only)
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-4B",
transformer=None,
text_encoder=None,
tokenizer=None,
torch_dtype=torch.bfloat16
)
# Inject quantized components
pipe.transformer = qtransformer._wrapped
pipe.text_encoder = text_encoder
pipe.tokenizer = tokenizer
pipe.to(device)
# Generate!
image = pipe(
prompt="A futuristic city at sunset, cyberpunk style",
height=512,
width=512,
num_inference_steps=4,
guidance_scale=0.0
).images[0]
image.save("output.png")
Performance
| Hardware | Speed (512x512, 4 steps) |
|---|---|
| CUDA (RTX 3090) | ~2-3 seconds |
| Apple Silicon (M1/M2/M3) | ~8 seconds |
Limitations
- Int4 not supported on MPS: PyTorch's int4 packed matmul has bugs on Apple Silicon. Int8 works on both CUDA and MPS.
- Requires diffusers flux2-klein branch: Until FLUX.2-klein is merged into main diffusers.
Files
βββ diffusion_pytorch_model.safetensors # Quantized transformer (3.6 GB)
βββ quanto_qmap.json # Transformer quantization map
βββ config.json # Transformer config
βββ quantized_flux2.py # Wrapper class
βββ text_encoder/
β βββ model.safetensors # Quantized Qwen3 (4.5 GB)
β βββ quanto_qmap.json # Text encoder quantization map
β βββ config.json
βββ tokenizer/ # Tokenizer files
βββ vae/ # VAE (not quantized)
Credits
- Original model: Black Forest Labs
- Quantization: optimum-quanto
License
Apache 2.0 (same as base model)
- Downloads last month
- 198
Model tree for aydin99/FLUX.2-klein-4B-int8
Base model
black-forest-labs/FLUX.2-klein-4B