ERNIE-Image-NF4

ERNIE-Image-NF4 is a BitsAndBytes 4-bit NF4 quantized version of ERNIE-Image. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost.

⚠️ Quality Disclaimer: This quantized model has NOT been evaluated for generation quality degradation. No systematic side-by-side comparison (e.g., FID, CLIP Score, human evaluation) has been conducted between the original ERNIE-Image and ERNIE-Image-NF4. NF4 quantization may introduce noticeable artifacts, loss of fine detail, reduced prompt adherence, or other quality regressions — especially in challenging scenarios such as complex compositions, small text rendering, or subtle color gradients. Users are strongly advised to run their own quality evaluations before relying on this model for production or quality-sensitive use cases.

Quantization Setup

This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers:

  • Quantization type: 4-bit NF4
  • Double quantization: bnb_4bit_use_double_quant=True
  • Compute dtype: bfloat16
  • Quantized components: transformer, text_encoder, pe
  • Components kept in original precision: vae, scheduler, tokenizer, pe_tokenizer

Quantization metadata is stored in:

  • quantization_metadata.json

Performance Experiment

Experiment setup:

  • GPU: NVIDIA GeForce RTX 5090
  • Prompt: A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.
  • Resolution: 512x512
  • Inference steps: 50
  • guidance_scale=4.0
  • seed=1234
  • Original model execution: enable_model_cpu_offload()
  • Quantized model execution: quantized transformer, text_encoder, and pe on GPU, with vae kept in bfloat16

Results:

Metric Original ERNIE-Image ERNIE-Image-NF4 Observation
Load time 3.60 s 7.70 s Quantized model still loads slower
Inference time 50.52 s 23.23 s Quantized model is about 2.17x faster
Total time 54.12 s 30.93 s Quantized model is about 1.75x faster overall
Peak reserved VRAM 15.62 GiB 9.95 GiB Peak memory drops by about 36.29%

10 Consecutive Inference Benchmark

To better reflect sustained generation throughput, an additional benchmark was run on the current machine with one-time model loading followed by 10 consecutive generations under identical settings.

Experiment setup:

  • GPU: NVIDIA GeForce RTX 5090
  • Prompt: A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.
  • Resolution: 512x512
  • Inference steps: 50
  • guidance_scale=4.0
  • Base seed: 1234, incremented by 1 for each image
  • Original model execution: enable_model_cpu_offload()
  • Quantized model execution: quantized transformer, text_encoder, and pe on GPU, with vae kept in bfloat16

Results:

Metric Original ERNIE-Image ERNIE-Image-NF4 Observation
Load time 3.72 s 7.77 s Quantized model still loads slower
Total time for 10 images 470.70 s 241.54 s Quantized model is about 1.95x faster
Average inference time per image 47.07 s 24.15 s Quantized model is about 1.95x faster
Peak reserved VRAM 15.66 GiB 10.01 GiB Reserved memory drops by about 36.07%
Peak allocated VRAM 15.48 GiB 9.25 GiB Allocated memory drops by about 40.23%

Interpretation:

  • The original model still needs CPU offload in this setup, so inference is noticeably slower
  • The quantized model has a more complex cold start, so load time remains longer
  • Once sampling starts, the quantized model is still significantly faster on the current machine
  • Lower peak VRAM remains one of the main reasons why the quantized version is easier to run efficiently

Inference Demo

Below is a minimal runnable example:

from pathlib import Path

import torch
from diffusers import AutoModel, ErnieImagePipeline
from transformers import AutoModel as TransformersAutoModel
from transformers import AutoModelForCausalLM

model_dir = Path("ERNIE-Image-NF4")

transformer = AutoModel.from_pretrained(
    str(model_dir / "transformer"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

text_encoder = TransformersAutoModel.from_pretrained(
    str(model_dir / "text_encoder"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pe = AutoModelForCausalLM.from_pretrained(
    str(model_dir / "pe"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pipe = ErnieImagePipeline.from_pretrained(
    str(model_dir),
    transformer=transformer,
    text_encoder=text_encoder,
    pe=pe,
    torch_dtype=torch.bfloat16,
    local_files_only=True,
)
pipe.vae.to("cuda", dtype=torch.bfloat16)

image = pipe(
    prompt="一只橙色的小猫坐在木桌上,前面放着一张写有 Hello Ernie-Image NF4 的白纸,纸张和文字在画面中布局居中,柔和自然光,写实风格。",
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True,
    generator=torch.Generator(device="cuda").manual_seed(1234),
).images[0]

image.save(model_dir / "demo_output.png")

Reference

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Xiejiehang/ERNIE-Image-NF4

Finetuned
(1)
this model