Bedovyy/ERNIE-Image-Quantized

•

would you willing to make a quantized version of prompt enhancer for ERNIE-Image? or you might think your qwen3.5 heretic version is sufficient? and would you willing to make a nvfp4mixed version for qwen3.5 4b heretic? since I find it seems that the smaller the model is, the faster it handles, and how it's quantized seem doesn't matter much. however I'm not sure if my observation is correct.

Bedovyy

Owner Apr 15

Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)

Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.

Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  63%|█████████████████████████████████████▉                      | 1297/2048 [00:27<00:15, 47.17it/s] # BF16
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  59%|███████████████████████████████████▍                        | 1211/2048 [00:41<00:28, 29.29it/s] # FP8
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  25%|███████████████▏                                             | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4
Prompt executed in 86.99 seconds

jimzlf

Apr 16

•

edited Apr 16

Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)

Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.

Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  63%|█████████████████████████████████████▉                      | 1297/2048 [00:27<00:15, 47.17it/s] # BF16
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  59%|███████████████████████████████████▍                        | 1211/2048 [00:41<00:28, 29.29it/s] # FP8
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  25%|███████████████▏                                             | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4
Prompt executed in 86.99 seconds

then it's hard for me to understand why the prompt enhancer works so slow, isn't the comfyui official model fp16? maybe because it takes over 10g vram on my 8g vram device? but why? it's not that big, and would a quantized version work faster for me?

Bedovyy

Owner Apr 16

•

edited Apr 16

I uploaded FP8 of Prompt Enhancer.
https://huggingface.co/Bedovyy/ERNIE-Image-Quantized/blob/main/text_encoders/ernie-image-prompt-enhancer-fp8.safetensors

Edit) Now it merged, so just update ComfyUI.
~~But, before use it, you should check the below. (change ErnieTEModel to ErnieTEModel_ on comfy/text_encoders/ernie.py~~
https://github.com/Comfy-Org/ComfyUI/pull/13431

--- a/comfy/text_encoders/ernie.py
+++ b/comfy/text_encoders/ernie.py
@@ -35,4 +35,4 @@ def te(dtype_llama=None, llama_quantization_metadata=None):
                 model_options = model_options.copy()
                 model_options["quantization_metadata"] = llama_quantization_metadata
             super().__init__(device=device, dtype=dtype, model_options=model_options)
-    return ErnieTEModel
+    return ErnieTEModel_

By the way, the above fix increase the generation speed of BF16, so try both BF16 and FP8.

Quick test (workflow included)

jimzlf

Apr 16

I uploaded FP8 of Prompt Enhancer.
https://huggingface.co/Bedovyy/ERNIE-Image-Quantized/blob/main/text_encoders/ernie-image-prompt-enhancer-fp8.safetensors

But, before use it, you should check the below. (change ErnieTEModel to ErnieTEModel_ on comfy/text_encoders/ernie.py
https://github.com/Comfy-Org/ComfyUI/pull/13431
--- a/comfy/text_encoders/ernie.py
+++ b/comfy/text_encoders/ernie.py
@@ -35,4 +35,4 @@ def te(dtype_llama=None, llama_quantization_metadata=None):
                 model_options = model_options.copy()
                 model_options["quantization_metadata"] = llama_quantization_metadata
             super().__init__(device=device, dtype=dtype, model_options=model_options)
-    return ErnieTEModel
+    return ErnieTEModel_
Quick test (workflow included)

thank you so much for your work

jimzlf

Apr 16

•

edited Apr 16

Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)

Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.

Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  63%|█████████████████████████████████████▉                      | 1297/2048 [00:27<00:15, 47.17it/s] # BF16
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  59%|███████████████████████████████████▍                        | 1211/2048 [00:41<00:28, 29.29it/s] # FP8
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens:  25%|███████████████▏                                             | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4
Prompt executed in 86.99 seconds

oh, seem fp8 faster on my device (ERNIE PE), might be due to the usage of vram

got prompt
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 6540MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 33%|███▎ | 673/2048 [01:37<03:18, 6.92it/s] #ComfyUI official PE model
Prompt executed in 102.92 seconds

got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 3654MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 28%|██▊ | 579/2048 [01:02<02:37, 9.31it/s] #your fp8 PE
Prompt executed in 68.36 seconds

retry:
got prompt
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 6540MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 0%| | 0/2048 [00:00<?, ?it/s]
Generating tokens: 0%| | 1/2048 [00:03<1:46:35, 3.12s/it]
Generating tokens: 0%| | 2/2048 [00:03<48:41, 1.43s/it]
Generating tokens: 0%| | 3/2048 [00:03<28:46, 1.18it/s]
Generating tokens: 0%| | 4/2048 [00:03<20:49, 1.64it/s]
Generating tokens: 0%| | 5/2048 [00:03<14:38, 2.33it/s]
Generating tokens: 0%| | 6/2048 [00:03<11:00, 3.09it/s]
Generating tokens: 0%| | 7/2048 [00:04<08:41, 3.91it/s]
Generating tokens: 0%| | 8/2048 [00:04<07:08, 4.76it/s]
Generating tokens: 0%| | 9/2048 [00:04<06:05, 5.57it/s]
Generating tokens: 0%| | 10/2048 [00:04<05:22, 6.32it/s]
Generating tokens: 1%| | 11/2048 [00:04<04:50, 7.02it/s]
Generating tokens: 1%| | 12/2048 [00:04<04:29, 7.57it/s]
Generating tokens: 1%| | 13/2048 [00:04<04:16, 7.92it/s]
Generating tokens: 1%| | 14/2048 [00:04<04:05, 8.29it/s]
Generating tokens: 1%| | 15/2048 [00:04<03:59, 8.49it/s]
Generating tokens: 1%| | 16/2048 [00:05<03:55, 8.64it/s]
Generating tokens: 1%| | 17/2048 [00:05<03:51, 8.79it/s]
Generating tokens: 1%| | 18/2048 [00:05<03:49, 8.85it/s]
Generating tokens: 1%| | 19/2048 [00:05<03:46, 8.97it/s]
……
Generating tokens: 33%|███▎ | 670/2048 [01:23<02:56, 7.81it/s]
Generating tokens: 33%|███▎ | 671/2048 [01:24<02:54, 7.87it/s]
Generating tokens: 33%|███▎ | 672/2048 [01:24<02:54, 7.89it/s]
Generating tokens: 33%|███▎ | 673/2048 [01:24<02:53, 7.94it/s]
Generating tokens: 33%|███▎ | 673/2048 [01:24<02:52, 7.97it/s]

Prompt executed in 89.58 seconds #fp16

got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 3654MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 0%| | 0/2048 [00:00<?, ?it/s]
Generating tokens: 0%| | 1/2048 [00:01<55:30, 1.63s/it]
Generating tokens: 0%| | 3/2048 [00:01<16:37, 2.05it/s]
Generating tokens: 0%| | 5/2048 [00:01<09:29, 3.58it/s]
Generating tokens: 0%| | 7/2048 [00:02<06:43, 5.06it/s]
Generating tokens: 0%| | 9/2048 [00:02<05:16, 6.45it/s]
Generating tokens: 1%| | 11/2048 [00:02<04:26, 7.64it/s]
Generating tokens: 1%| | 13/2048 [00:02<03:59, 8.51it/s]
Generating tokens: 1%| | 15/2048 [00:02<03:36, 9.39it/s]
Generating tokens: 1%| | 17/2048 [00:03<03:25, 9.88it/s]
Generating tokens: 1%| | 19/2048 [00:03<03:20, 10.12it/s]
Generating tokens: 1%| | 21/2048 [00:03<03:21, 10.05it/s]
Generating tokens: 1%| | 23/2048 [00:03<03:13, 10.47it/s]
Generating tokens: 1%| | 25/2048 [00:03<03:19, 10.13it/s]
Generating tokens: 1%|▏ | 27/2048 [00:03<03:16, 10.31it/s]
Generating tokens: 1%|▏ | 29/2048 [00:04<03:10, 10.58it/s]
Generating tokens: 2%|▏ | 31/2048 [00:04<03:08, 10.69it/s]
Generating tokens: 2%|▏ | 33/2048 [00:04<03:05, 10.86it/s]
Generating tokens: 2%|▏ | 35/2048 [00:04<03:00, 11.15it/s]
Generating tokens: 2%|▏ | 37/2048 [00:04<03:01, 11.07it/s]
Generating tokens: 2%|▏ | 39/2048 [00:05<02:56, 11.37it/s]
Generating tokens: 2%|▏ | 41/2048 [00:05<02:56, 11.39it/s]
Generating tokens: 2%|▏ | 43/2048 [00:05<02:54, 11.46it/s]
Generating tokens: 2%|▏ | 45/2048 [00:05<03:00, 11.12it/s]
Generating tokens: 2%|▏ | 47/2048 [00:05<03:01, 11.05it/s]
Generating tokens: 2%|▏ | 49/2048 [00:05<03:01, 11.00it/s]
Generating tokens: 2%|▏ | 51/2048 [00:06<02:57, 11.24it/s]
……
Generating tokens: 28%|██▊ | 576/2048 [00:59<03:04, 7.97it/s]
Generating tokens: 28%|██▊ | 577/2048 [00:59<03:02, 8.04it/s]
Generating tokens: 28%|██▊ | 578/2048 [00:59<03:00, 8.13it/s]
Generating tokens: 28%|██▊ | 579/2048 [00:59<03:01, 8.09it/s]
Generating tokens: 28%|██▊ | 579/2048 [00:59<02:32, 9.66it/s]
Prompt executed in 64.96 seconds # fp8