JoyAI-Image-Edit — FP8 Quantized

FP8 (float8_e4m3fn) quantized version of the JoyAI-Image-Edit model transformer.

What's Different

Component	Original	FP8
DiT Transformer	32.5 GB (bf16)	16.3 GB (fp8_e4m3fn)
VAE	485 MB (bf16)	485 MB (bf16, unchanged)
Text Encoder (Qwen3VL)	~17 GB (bf16)	~17 GB (bf16, unchanged)

The transformer's 407 linear-layer weight matrices (2D+, ≥1024 elements) are stored in float8_e4m3fn. All biases, normalization weights, and small tensors remain in bfloat16.

Why FP8

The full bf16 transformer is 32.5 GB — too large for consumer GPUs like the RTX 4090 (24 GB VRAM). At FP8, the transformer fits in ~16 GB, leaving headroom for activations and the VAE during inference. The text encoder (Qwen3VL) can be offloaded to CPU after conditioning.

RTX 4090 (Ada Lovelace, SM89) has native FP8 hardware support.

Inference Tool

A Gradio UI, CLI, and REST API for running inference are available at SanDiegoDude/JoyAI-Image — a handy way to run the model until proper ComfyUI integration lands. Features include auto-download from HuggingFace, multiple memory modes, and bitsandbytes quantization for the text encoder.

Quick start

git clone https://github.com/SanDiegoDude/JoyAI-Image.git
cd JoyAI-Image
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Models auto-download on first run (FP8 transformer is the default)
# Default: FP8 transformer + 8-bit text encoder, offload mode
python app.py

# Minimum VRAM (~13 GB active, fits RTX 4090)
python app.py --nf4-dit --4bit-vlm

# CLI inference
python inference.py \
  --prompt "Turn the plate blue" \
  --image test_images/test_1.jpg \
  --output result.png \
  --steps 18 --guidance-scale 4.0 --seed 42

# Headless REST API (with optional ComfyUI connector node)
python app.py --headless-api 7500

Also available

Full bf16 safetensors — original precision, also the source weights for runtime NF4 quantization (~8 GB)

Conversion

Quantized using a straightforward per-tensor cast to torch.float8_e4m3fn with value clamping to the representable range (±448). This matches the approach used by ComfyUI and other diffusion model tooling.

Model Details

JoyAI-Image is a unified multimodal foundation model comprising:

8B MLLM (Qwen3VL) for text/image understanding
16B MMDiT for image generation (this is the quantized component)
Wan2.1 VAE for encoding/decoding

See the original repository for full documentation.

Credits

Original model by JD Open Source, released under Apache 2.0. FP8 conversion by SanDiegoDude.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for SanDiegoDude/JoyAI-Image-Edit-FP8

Base model

jdopensource/JoyAI-Image-Edit

Finetuned

(4)

this model