prompt enhancer
would you willing to make a quantized version of prompt enhancer for ERNIE-Image? or you might think your qwen3.5 heretic version is sufficient? and would you willing to make a nvfp4mixed version for qwen3.5 4b heretic? since I find it seems that the smaller the model is, the faster it handles, and how it's quantized seem doesn't matter much. however I'm not sure if my observation is correct.
Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)
Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens: 63%|ββββββββββββββββββββββββββββββββββββββ | 1297/2048 [00:27<00:15, 47.17it/s] # BF16
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens: 59%|ββββββββββββββββββββββββββββββββββββ | 1211/2048 [00:41<00:28, 29.29it/s] # FP8
0 models unloaded.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB.
Generating tokens: 25%|ββββββββββββββββ | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4
Prompt executed in 86.99 seconds
Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 63%|ββββββββββββββββββββββββββββββββββββββ | 1297/2048 [00:27<00:15, 47.17it/s] # BF16 0 models unloaded. Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 59%|ββββββββββββββββββββββββββββββββββββ | 1211/2048 [00:41<00:28, 29.29it/s] # FP8 0 models unloaded. Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 25%|ββββββββββββββββ | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4 Prompt executed in 86.99 seconds
then it's hard for me to understand why the prompt enhancer works so slow, isn't the comfyui official model fp16? maybe because it takes over 10g vram on my 8g vram device? but why? it's not that big, and would a quantized version work faster for me?
I uploaded FP8 of Prompt Enhancer.
https://huggingface.co/Bedovyy/ERNIE-Image-Quantized/blob/main/text_encoders/ernie-image-prompt-enhancer-fp8.safetensors
Edit) Now it merged, so just update ComfyUI.But, before use it, you should check the below. (change ErnieTEModel to ErnieTEModel_ on comfy/text_encoders/ernie.py
https://github.com/Comfy-Org/ComfyUI/pull/13431
--- a/comfy/text_encoders/ernie.py
+++ b/comfy/text_encoders/ernie.py
@@ -35,4 +35,4 @@ def te(dtype_llama=None, llama_quantization_metadata=None):
model_options = model_options.copy()
model_options["quantization_metadata"] = llama_quantization_metadata
super().__init__(device=device, dtype=dtype, model_options=model_options)
- return ErnieTEModel
+ return ErnieTEModel_
By the way, the above fix increase the generation speed of BF16, so try both BF16 and FP8.
Quick test (workflow included)
I uploaded FP8 of Prompt Enhancer.
https://huggingface.co/Bedovyy/ERNIE-Image-Quantized/blob/main/text_encoders/ernie-image-prompt-enhancer-fp8.safetensorsBut, before use it, you should check the below. (change
ErnieTEModeltoErnieTEModel_oncomfy/text_encoders/ernie.py
https://github.com/Comfy-Org/ComfyUI/pull/13431--- a/comfy/text_encoders/ernie.py +++ b/comfy/text_encoders/ernie.py @@ -35,4 +35,4 @@ def te(dtype_llama=None, llama_quantization_metadata=None): model_options = model_options.copy() model_options["quantization_metadata"] = llama_quantization_metadata super().__init__(device=device, dtype=dtype, model_options=model_options) - return ErnieTEModel + return ErnieTEModel_Quick test (workflow included)
thank you so much for your work
Actually, BF16 version is much faster than FP8 or NVFP4 and more accurate.
So, unless your GPU can't handle BF16, you don't need FP8 or NVFP4 model. (and that's why I didn't uploadNVFP4 of Qwen3.5-4B)Here's test of Kewk/Heretical-Qwen3.5-9B and quants models.
Model Qwen35TEModel_ prepared for dynamic VRAM loading. 17947MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 63%|ββββββββββββββββββββββββββββββββββββββ | 1297/2048 [00:27<00:15, 47.17it/s] # BF16 0 models unloaded. Model Qwen35TEModel_ prepared for dynamic VRAM loading. 11355MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 59%|ββββββββββββββββββββββββββββββββββββ | 1211/2048 [00:41<00:28, 29.29it/s] # FP8 0 models unloaded. Model Qwen35TEModel_ prepared for dynamic VRAM loading. 9143MB Staged. 0 patches attached. Force pre-loaded 105 weights: 534 KB. Generating tokens: 25%|ββββββββββββββββ | 509/2048 [00:17<00:52, 29.47it/s] # NVFP4 Prompt executed in 86.99 seconds
oh, seem fp8 faster on my device (ERNIE PE), might be due to the usage of vram
got prompt
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 6540MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 33%|ββββ | 673/2048 [01:37<03:18, 6.92it/s] #ComfyUI official PE model
Prompt executed in 102.92 seconds
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 3654MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 28%|βββ | 579/2048 [01:02<02:37, 9.31it/s] #your fp8 PE
Prompt executed in 68.36 seconds
retry:
got prompt
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 6540MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 0%| | 0/2048 [00:00<?, ?it/s]
Generating tokens: 0%| | 1/2048 [00:03<1:46:35, 3.12s/it]
Generating tokens: 0%| | 2/2048 [00:03<48:41, 1.43s/it]
Generating tokens: 0%| | 3/2048 [00:03<28:46, 1.18it/s]
Generating tokens: 0%| | 4/2048 [00:03<20:49, 1.64it/s]
Generating tokens: 0%| | 5/2048 [00:03<14:38, 2.33it/s]
Generating tokens: 0%| | 6/2048 [00:03<11:00, 3.09it/s]
Generating tokens: 0%| | 7/2048 [00:04<08:41, 3.91it/s]
Generating tokens: 0%| | 8/2048 [00:04<07:08, 4.76it/s]
Generating tokens: 0%| | 9/2048 [00:04<06:05, 5.57it/s]
Generating tokens: 0%| | 10/2048 [00:04<05:22, 6.32it/s]
Generating tokens: 1%| | 11/2048 [00:04<04:50, 7.02it/s]
Generating tokens: 1%| | 12/2048 [00:04<04:29, 7.57it/s]
Generating tokens: 1%| | 13/2048 [00:04<04:16, 7.92it/s]
Generating tokens: 1%| | 14/2048 [00:04<04:05, 8.29it/s]
Generating tokens: 1%| | 15/2048 [00:04<03:59, 8.49it/s]
Generating tokens: 1%| | 16/2048 [00:05<03:55, 8.64it/s]
Generating tokens: 1%| | 17/2048 [00:05<03:51, 8.79it/s]
Generating tokens: 1%| | 18/2048 [00:05<03:49, 8.85it/s]
Generating tokens: 1%| | 19/2048 [00:05<03:46, 8.97it/s]
β¦β¦
Generating tokens: 33%|ββββ | 670/2048 [01:23<02:56, 7.81it/s]
Generating tokens: 33%|ββββ | 671/2048 [01:24<02:54, 7.87it/s]
Generating tokens: 33%|ββββ | 672/2048 [01:24<02:54, 7.89it/s]
Generating tokens: 33%|ββββ | 673/2048 [01:24<02:53, 7.94it/s]
Generating tokens: 33%|ββββ | 673/2048 [01:24<02:52, 7.97it/s]
Prompt executed in 89.58 seconds #fp16
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ErnieTEModel_
Model ErnieTEModel_ prepared for dynamic VRAM loading. 3654MB Staged. 0 patches attached. Force pre-loaded 53 weights: 318 KB.
Generating tokens: 0%| | 0/2048 [00:00<?, ?it/s]
Generating tokens: 0%| | 1/2048 [00:01<55:30, 1.63s/it]
Generating tokens: 0%| | 3/2048 [00:01<16:37, 2.05it/s]
Generating tokens: 0%| | 5/2048 [00:01<09:29, 3.58it/s]
Generating tokens: 0%| | 7/2048 [00:02<06:43, 5.06it/s]
Generating tokens: 0%| | 9/2048 [00:02<05:16, 6.45it/s]
Generating tokens: 1%| | 11/2048 [00:02<04:26, 7.64it/s]
Generating tokens: 1%| | 13/2048 [00:02<03:59, 8.51it/s]
Generating tokens: 1%| | 15/2048 [00:02<03:36, 9.39it/s]
Generating tokens: 1%| | 17/2048 [00:03<03:25, 9.88it/s]
Generating tokens: 1%| | 19/2048 [00:03<03:20, 10.12it/s]
Generating tokens: 1%| | 21/2048 [00:03<03:21, 10.05it/s]
Generating tokens: 1%| | 23/2048 [00:03<03:13, 10.47it/s]
Generating tokens: 1%| | 25/2048 [00:03<03:19, 10.13it/s]
Generating tokens: 1%|β | 27/2048 [00:03<03:16, 10.31it/s]
Generating tokens: 1%|β | 29/2048 [00:04<03:10, 10.58it/s]
Generating tokens: 2%|β | 31/2048 [00:04<03:08, 10.69it/s]
Generating tokens: 2%|β | 33/2048 [00:04<03:05, 10.86it/s]
Generating tokens: 2%|β | 35/2048 [00:04<03:00, 11.15it/s]
Generating tokens: 2%|β | 37/2048 [00:04<03:01, 11.07it/s]
Generating tokens: 2%|β | 39/2048 [00:05<02:56, 11.37it/s]
Generating tokens: 2%|β | 41/2048 [00:05<02:56, 11.39it/s]
Generating tokens: 2%|β | 43/2048 [00:05<02:54, 11.46it/s]
Generating tokens: 2%|β | 45/2048 [00:05<03:00, 11.12it/s]
Generating tokens: 2%|β | 47/2048 [00:05<03:01, 11.05it/s]
Generating tokens: 2%|β | 49/2048 [00:05<03:01, 11.00it/s]
Generating tokens: 2%|β | 51/2048 [00:06<02:57, 11.24it/s]
β¦β¦
Generating tokens: 28%|βββ | 576/2048 [00:59<03:04, 7.97it/s]
Generating tokens: 28%|βββ | 577/2048 [00:59<03:02, 8.04it/s]
Generating tokens: 28%|βββ | 578/2048 [00:59<03:00, 8.13it/s]
Generating tokens: 28%|βββ | 579/2048 [00:59<03:01, 8.09it/s]
Generating tokens: 28%|βββ | 579/2048 [00:59<02:32, 9.66it/s]
Prompt executed in 64.96 seconds # fp8

