Qwen3-4B-FP8

FP8 (W8A8) quantized version of Qwen/Qwen3-4B, created using llm-compressor with calibrated quantization.

Overview

Property Value
Base Model Qwen/Qwen3-4B
Parameters 4.41B
Quantization FP8 (W8A8)
Format compressed-tensors
Tool llm-compressor
Disk Size ~4.9 GB (2 shards)
VRAM ~3.96 GB

Intended Use

Quantized text encoder for Flux 2 Klein 4B image generation pipelines. Architecturally identical to the Klein 4B text encoder.

Quantization Details

  • Scheme: FP8 — 8-bit float weights and activations (float8_e4m3fn)
  • Targets: All Linear layers (excluding lm_head)
  • Calibration: 256 samples, sequential pipeline with CPU offloading

Hardware Requirements

  • Minimum: NVIDIA Hopper (CC 8.9+) or Ada Lovelace for native FP8 inference
  • Fallback: Dequantizes to BF16 on older hardware
Downloads last month
11
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vistralis/Qwen3-4B-FP8

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(185)
this model