Qwen3-4B-FP8
FP8 (W8A8) quantized version of Qwen/Qwen3-4B, created using llm-compressor with calibrated quantization.
Overview
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-4B |
| Parameters | 4.41B |
| Quantization | FP8 (W8A8) |
| Format | compressed-tensors |
| Tool | llm-compressor |
| Disk Size | ~4.9 GB (2 shards) |
| VRAM | ~3.96 GB |
Intended Use
Quantized text encoder for Flux 2 Klein 4B image generation pipelines. Architecturally identical to the Klein 4B text encoder.
Quantization Details
- Scheme: FP8 — 8-bit float weights and activations (
float8_e4m3fn) - Targets: All
Linearlayers (excludinglm_head) - Calibration: 256 samples, sequential pipeline with CPU offloading
Hardware Requirements
- Minimum: NVIDIA Hopper (CC 8.9+) or Ada Lovelace for native FP8 inference
- Fallback: Dequantizes to BF16 on older hardware
- Downloads last month
- 11