ERNIE-Image-NF4

ERNIE-Image-NF4 is a BitsAndBytes 4-bit NF4 quantized version of ERNIE-Image. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost.

⚠️ Quality Disclaimer: This quantized model has NOT been evaluated for generation quality degradation. No systematic side-by-side comparison (e.g., FID, CLIP Score, human evaluation) has been conducted between the original ERNIE-Image and ERNIE-Image-NF4. NF4 quantization may introduce noticeable artifacts, loss of fine detail, reduced prompt adherence, or other quality regressions — especially in challenging scenarios such as complex compositions, small text rendering, or subtle color gradients. Users are strongly advised to run their own quality evaluations before relying on this model for production or quality-sensitive use cases.

Quantization Setup

This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers:

Quantization type: 4-bit NF4
Double quantization: bnb_4bit_use_double_quant=True
Compute dtype: bfloat16
Quantized components: transformer, text_encoder, pe
Components kept in original precision: vae, scheduler, tokenizer, pe_tokenizer

Quantization metadata is stored in:

quantization_metadata.json

Performance Experiment

Experiment setup:

GPU: NVIDIA GeForce RTX 5090
Prompt: A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.
Resolution: 512x512
Inference steps: 50
guidance_scale=4.0
seed=1234
Original model execution: enable_model_cpu_offload()
Quantized model execution: quantized transformer, text_encoder, and pe on GPU, with vae kept in bfloat16

Results:

Metric	Original ERNIE-Image	ERNIE-Image-NF4	Observation
Load time	3.60 s	7.70 s	Quantized model still loads slower
Inference time	50.52 s	23.23 s	Quantized model is about `2.17x` faster
Total time	54.12 s	30.93 s	Quantized model is about `1.75x` faster overall
Peak reserved VRAM	15.62 GiB	9.95 GiB	Peak memory drops by about `36.29%`

10 Consecutive Inference Benchmark

To better reflect sustained generation throughput, an additional benchmark was run on the current machine with one-time model loading followed by 10 consecutive generations under identical settings.

Experiment setup:

GPU: NVIDIA GeForce RTX 5090
Prompt: A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.
Resolution: 512x512
Inference steps: 50
guidance_scale=4.0
Base seed: 1234, incremented by 1 for each image
Original model execution: enable_model_cpu_offload()
Quantized model execution: quantized transformer, text_encoder, and pe on GPU, with vae kept in bfloat16

Results:

Metric	Original ERNIE-Image	ERNIE-Image-NF4	Observation
Load time	3.72 s	7.77 s	Quantized model still loads slower
Total time for 10 images	470.70 s	241.54 s	Quantized model is about `1.95x` faster
Average inference time per image	47.07 s	24.15 s	Quantized model is about `1.95x` faster
Peak reserved VRAM	15.66 GiB	10.01 GiB	Reserved memory drops by about `36.07%`
Peak allocated VRAM	15.48 GiB	9.25 GiB	Allocated memory drops by about `40.23%`

Interpretation:

The original model still needs CPU offload in this setup, so inference is noticeably slower
The quantized model has a more complex cold start, so load time remains longer
Once sampling starts, the quantized model is still significantly faster on the current machine
Lower peak VRAM remains one of the main reasons why the quantized version is easier to run efficiently

Inference Demo

Below is a minimal runnable example:

from pathlib import Path

import torch
from diffusers import AutoModel, ErnieImagePipeline
from transformers import AutoModel as TransformersAutoModel
from transformers import AutoModelForCausalLM

model_dir = Path("ERNIE-Image-NF4")

transformer = AutoModel.from_pretrained(
    str(model_dir / "transformer"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

text_encoder = TransformersAutoModel.from_pretrained(
    str(model_dir / "text_encoder"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pe = AutoModelForCausalLM.from_pretrained(
    str(model_dir / "pe"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pipe = ErnieImagePipeline.from_pretrained(
    str(model_dir),
    transformer=transformer,
    text_encoder=text_encoder,
    pe=pe,
    torch_dtype=torch.bfloat16,
    local_files_only=True,
)
pipe.vae.to("cuda", dtype=torch.bfloat16)

image = pipe(
    prompt="一只橙色的小猫坐在木桌上，前面放着一张写有 Hello Ernie-Image NF4 的白纸，纸张和文字在画面中布局居中，柔和自然光，写实风格。",
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True,
    generator=torch.Generator(device="cuda").manual_seed(1234),
).images[0]

image.save(model_dir / "demo_output.png")

Reference

Downloads last month: 6

Model tree for Xiejiehang/ERNIE-Image-NF4

Base model

baidu/ERNIE-Image

Finetuned

(11)

this model