Clip Model compares

by scruffynerf - opened Dec 16, 2025

Dec 16, 2025

•

edited Dec 16, 2025

I've written a bunch of ComfyUI workflows to compare various things like Models, Clips, VAEs, and lots of other things... Here's a few samples of just the Clip model changes. In order, it's the stock Qwen3 4b that ZImage is paired with, then 2 different versions of Josiefied Qwen3 4b (the first is the version including in JoZiMagic AIO, the 2nd is one I'm testing to see if I want to replace the first with it), and the 4th is Z-Image-Engineer.
All are otherwise the same: bf16 Zmage model/stock VAE/seed/prompt/steps/sampler/etc/etc.

@BennyDaBall let me know if you want the WF, or any particular images, I have the full sized ones, but only posting a pile of comparisons here (and will likely post some on Civitai...)

hufflepuffle

Dec 16, 2025

•

edited Dec 16, 2025

This is fascinating! Would love to see more results from your tinkering

Kutches

Dec 16, 2025

Can you show a link to where is the Josiefied Qwen3 4b V2 version to try?

CynicalSpore

Dec 16, 2025

•

edited Dec 16, 2025

scruffynerf

Dec 16, 2025

Can you show a link to where is the Josiefied Qwen3 4b V2 version to try?
https://huggingface.co/Goekdeniz-Guelmez/Josiefied-Qwen3-4B-abliterated-v2

scruffynerf

Dec 17, 2025

I did a merge 50/50 with Engineer and JosieV2... testing it now. And it hasn't 'broken' the way Engineer seems to sometimes... image wise. Haven't tested it as an LLM yet.

Andyx1976

Dec 20, 2025

•

edited Dec 20, 2025

the z-engineer one feels too creative, adding to much stuff. The "Explorer in his 50's" has so many details he looks like about 95 and was too long in the water on top of that. Or the painting standing on the floor.
I know working in the clip is not really it's purpose, and for making prompts that creativity is likely good.

But what's interesting is how underwhelming the default model is in some examples. I didn't realize that. But with the above issue, the josie v2 seems to be a good compromise.

Also... i looked for that in custom nodes, what would be interesting for a prompt creating system would be a vlm, trained like this. . So it can read an image to make the prompt based on it, all in one go. Or see the result and make changes.
of course the easiest way to do that just locally, tried that in LM Studio using a (normal) VLM, not a z-image specialist like yours (that's why a vlm with this training would be cool.

i actually had it judge the image result from it's own prompt. Qwen3 vl (4b, 8b, 30b) was happy with it (it fits the prompt). But Gemma3(27b) had actually constructive ideas (but in context) for what details to add (to the image and the prompt).

scruffynerf

Dec 20, 2025

Engineer2 is so much better... right now, it's often at the top of my grid compares. Will be posting more later.

Kutches

Dec 20, 2025

Engineer2 is so much better... right now, it's often at the top of my grid compares. Will be posting more later.

the more tests i can see the better

BennyDaBall

Owner Dec 20, 2025

Just wanted to drop these here since this discussion is on-going... 😉

Example SIMPLE workflow & SIMPLE input prompt:

(VAE Decode and ConditioningZeroOut out nodes are behind the image preview, doh!)
Example output prompt from Z-Engineer-V2 for the above input prompt:

Example of multiple successive generations with fixed ksampler seed, randomized Z-Engineer node seed (this triggers a new prompt to be generated from your input!), and all other settings the same:

(Prompt: "an ordinary rock sitting on a sidewalk" produced the 6 examples - for the image in the ksampler preview I added "mundane and boring 24mm f8 lens really bright day grass next to sidewalk suburan neighborhood")

Z-Engineer-V2 retains instruction following well - for instance, say you have a LoRA that requires a trigger phrase i.e. "ROCK" - you can instruct the model in your input prompt with something like this "Your response must start with the word 'ROCK' unquoted." and it will do it correctly 99% of the time! You can also append those instructions to the system prompt, but I wouldn't recommend that as it may more drastically alter the models output behavior in unexpected ways.

Additionally, for those of us with a bit more compute/VRAM available...I have tested the Z-Engineer node with Z-Engineer-V2 system prompt with additional abliterated models to produce some stunning results... Models with advanced logic, reasoning and spatial-visual understanding seem to rise to the top!

In order of success for the larger models (and how they compare to Z-Engineer-V2 at only 4b!) here are my purely subjective rankings:

Devstral Small 2 - amazing detailed results, great visual storytelling and seems to be the most accurate prompt enhancer/rewriter I've tested. (https://huggingface.co/AliBilge/Huihui-Devstral-Small-2-24B-Instruct-2512-abliterated)
Z-Engineer-V2 - Punches above it's weight class for sure! Hyper-creative, sometimes a bit "extra" if you give it full creative control (see "a rock" example above 🤣)
GPT OSS 20b (https://huggingface.co/nightmedia/Huihui-gpt-oss-20b-mxfp4-abliterated-v2-qx86-hi-mlx) - Great input adherence and imagination, often requires large context for thinking, adding substantial (40-50 seconds on my setup) of generation time. Also, some refusals here and there.
Dark Champion 8x4b (https://huggingface.co/DavidAU/Llama-3.2-8X4B-MOE-V2-Dark-Champion-Instruct-uncensored-abliterated-21B-GGUF) - Very coherent prompts dripping with flamboyant prose- Not great for real-world/photo stuff, but, for some reason, it produces decent illustration/anime scene prompts.
Rogue Creative 7b (https://huggingface.co/DavidAU/L3.2-Rogue-Creative-Instruct-Uncensored-Abliterated-7B) - Output prompt quality similar to Dark champion, less input prompt adherence.

Andyx1976

Dec 20, 2025

•

edited Dec 21, 2025

The system prompt seems to be slightly different from the one in the model card. Do you use the default z-engineer system prompt or another one, just wondering which part of the recommended one is helpful triggering training data, which bits triggers known z-image functions and which are personal taste.
I find using simple brackets and is underrated in a sea of "descriptive natural language prompt" advice. , like "tom, a man (age 20, tall, slim, white t-shirt, black jeans, shiny black boots) does whatever" work fine, saves tokens and confusion, and especially when conveying complicated multi-people scenes (thus the names) it can get hard to follow a comma loaded natural language prompt for a model .

instead of THIS example (mistral3.1 24b)

He wears a fitted white cotton t-shirt showing subtle breathability texture, high-waisted black denim jeans with visible raw hem stitching, and shiny black leather boots that reflect fluorescent ceiling lights casting sharp directional shadows across the spacious gymnasium...

Qwen models were a bit more concise than that. Maybe this is why Flux2 is so horrible at anatomy and "complicated" stuff like ... two people with hands..., mistral (its clip) just confuses it with too much fluff?)

The essential "issue" is that qwen image, z-image and others run an powerful llm, which (even in clip mode) can just deal with a lot of stuff more intelligently than a t5 encoder that needs a certain structure to function well. In other words there are few fixed rules that you HAVE to obey.
Anyway i'm gonna have to try your lmstudio connected node because all aio llm nodes (qwen generation, prompt helper) make my comfyui ui explode after the first run.
edit: works without crashing. Great job :)

BennyDaBall

Owner Dec 21, 2025

@Andyx1976
It's a tuned instruct model that was trained on synthetic conversation examples of prompt enhancement - the system prompt used in the training/dataset generation is closely aligned with the system prompt, but I am frequently fiddling with stuff without pushing every little update or change. But I'll double check and maybe add a couple more for specific tasks. It's understanding of how to effectively enhance a prompt is triggered by either explicitly telling it to enhance a prompt or just prompting it without additional context - the prompt just helps it stay on format.

I really have no idea why the Z-Image-Engineer model produces different results from the base abliterated model, and base instruct model, when used as a clip - haven't investigated it and using it as a clip was only an afterthought. 4b was a fun model size to finetune locally for a first/second try and i thought this would be a fun model. The fact that it does anything different than the base abliterated model really surprises me!

I have plans to train another LoRA, the dataset built with structured synthetic image descriptions generated LOCALLY using a VL model and tens of thousands of unpublished professional photos and generated images...stay tuned for V3!

scruffynerf

Dec 21, 2025

so I think there is confusion here, and it's understandable, because the use of a LLM as clip model has 'broken new ground' and most people don't understand the implication.
(Benny gets it already, but for others...)

Using Qwen3 as an encoder, by default, means that the text passed to the encoder is tokenized. These tokens are passed to ZImage the image producing model.
There is no 'thought' or LLM activity in this, it's a process of 'translation', same as if I asked someone to translate from one language to another.
But like ALL translations: the translator influences the translation. It's not always EXACTLY the same. I use the Tao Te Ching as the example here, for clarity.
If you look at the Chinese Text of verse/chapter 1:
道可道，非常道。名可名，非常名。無名天地之始；有名萬物之母。故常無欲，以觀其妙；常有欲，以觀其徼。此兩者，同出而異名，同謂之玄。玄之又玄，衆妙之門。

if you ask Google translate:
The Tao that can be spoken of is not the eternal Tao. The name that can be named is not the eternal name. Nameless, it is the origin of Heaven and Earth; named, it is the mother of all things. Therefore, one should always be without desire, so as to observe its mystery; one should always have desire, so as to observe its manifestations. These two are the same in origin but different in name; they are both called profound. Profound and yet more profound, the gateway to all mysteries.

James Legge translated it as:
The Dao that can be trodden is not the enduring and unchanging Dao. The name that can be named is not the enduring and unchanging name. (Conceived of as) having no name, it is the Originator of heaven and earth; (conceived of as) having a name, it is the Mother of all things.
Always without desire we must be found,
If its deep mystery we would sound;
But if desire always within us be,
Its outer fringe is all that we shall see.
Under these two aspects, it is really the same; but as development takes place, it receives the different names. Together we call them the Mystery. Where the Mystery is the deepest is the gate of all that is subtle and wonderful.

Stefan Stenudd translates it as:
The Way that can be walked is not the eternal Way.
The name that can be named is not the eternal name.
The nameless is the beginning of Heaven and Earth.
The named is the mother of all things.
Therefore:
Free from desire you see the mystery.
Full of desire you see the manifestations.
These two have the same origin but differ in name.
That is the secret,
The secret of secrets,
The gate to all mysteries.

Jane English and Gia-Fu Feng:
The Tao that can be told is not the eternal Tao.
The name that can be named is not the eternal name.
The nameless is the beginning of heaven and Earth.
The named is the mother of the ten thousand things.
Ever desireless, one can see the mystery.
Ever desiring, one sees the manifestations.
These two spring from the same source but differ in name;
this appears as darkness.
Darkness within darkness.
The gate to all mystery.

Dozens more translations exist:
https://www.egreenway.com/taoism/ttclztrans3.htm

EVERY ONE OF THEM is a version of the same source.

Think of the changes an LLM undergoes during training as affecting their translation skills.
In other words, we aren't asking a question of them and getting an 'ANSWER'.
An 'Answer' is what happens when you use Z-Engineer [or Claude or ChatGPT or Qwen3VL etc] to craft a prompt.
The above is what happens WITH ANY PROMPT.
It's translated by the 'clip model' into Tokens which are sent to Zimage, and THOSE TOKEN are the map in the 'latent space' to the exact position
of your image (in a sense). Notice that EACH of the 'directions' above are similar but different. They will not lead to IDENTICAL places, only Similar ones.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment