vistralis
/

Qwen3-4B-FP8

Text Generation

compressed-tensors

text-generation-inference

Model card Files Files and versions

Qwen3-4B-FP8

FP8 (W8A8) quantized version of Qwen/Qwen3-4B, created using llm-compressor with calibrated quantization.

Overview

Property	Value
Base Model	Qwen/Qwen3-4B
Parameters	4.41B
Quantization	FP8 (W8A8)
Format	`compressed-tensors`
Tool	llm-compressor
Disk Size	~4.9 GB (2 shards)
VRAM	~3.96 GB

Intended Use

Quantized text encoder for Flux 2 Klein 4B image generation pipelines. Architecturally identical to the Klein 4B text encoder.

Quantization Details

Scheme: FP8 — 8-bit float weights and activations (float8_e4m3fn)
Targets: All Linear layers (excluding lm_head)
Calibration: 256 samples, sequential pipeline with CPU offloading

Hardware Requirements

Minimum: NVIDIA Hopper (CC 8.9+) or Ada Lovelace for native FP8 inference
Fallback: Dequantizes to BF16 on older hardware

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

·

F8_E4M3

·

Model tree for vistralis/Qwen3-4B-FP8

Base model

Qwen/Qwen3-4B-Base

Finetuned

Quantized

(206)

this model