BennyDaBall/qwen3-4b-Z-Image-Engineer · Sorry, I'd like to know if this model is only used as a text encoder.

Dec 14, 2025

Please forgive me for raising this question.

Owner Dec 14, 2025

It's mostly for chatting/prompt writing/enhancement!

You CAN use it as a text encoder and it will work about as well as any other abliterated/heretic'd qwen3 4b instruct 2507 will, but that is NOT its primary purpose.

The LoRA I trained improves it's ability to take whatever prompt you give it, for example "an elephant riding a bicycle in a supermarket," embellish it with in a positive/additive manner (i.e. adding details or specifying spatial relationships you left out, etc), into a 120-180 word paragraph. Even without the additional (optional) system prompt, it will respond to you with an image prompt. The Heretic abliteration was also targeted at image prompt generation refusal by turning some knobs and tweaking word lists, so it might refuse to do certain non-image generation related tasks unlike a more radically abliterated model.

Dec 14, 2025

@BennyDaBall I have code to use a qwen3 model as both LLM then passing those tokens generated directly onward to Zimage. so you can skip the decode/rencode piece.

My query:
Make me a sandwich, please. Serve it to me at a fancy restaurant, but I want a beer and curly fries with it.

As generated by Qwen3 (Josiefied Qwen3 4b), this was sent DIRECTLY in the conditioning noodle to Zimage, but is translated here (detokenized) for human understanding:

{
  "scene": "A luxurious, candle-lit dining room with elegant chandeliers and plush velvet seating.",
  "sandwich": {
    "type": "artisanal turkey and avocado sandwich",
    "ingredients": [
      "thinly sliced turkey breast",
      "mature cheddar cheese",
      "sliced avocado",
      "crispy bacon bits",
      "fresh arugula",
      "diced tomatoes",
      "red onion",
      "crumbled goat cheese"
    ],
    "bread": "crusty sourdough",
    "condiments": [
      "mayonnaise",
      "lemon juice",
      "dill",
      "herb butter"
    ],
    "presentation": "served on a large, rustic wooden board with a sprig of fresh rosemary"
  },
  "beer": {
    "type": "craft IPA",
    "glass": "tall, narrow glass with a frothy head",
    "presentation": "placed beside the sandwich on the wooden board"
  },
  "curly_fries": {
    "type": "crispy curly fries",
    "presentation": "served in a small, decorative ceramic bowl with a side of melted parmesan cheese"
  },
  "setting": "The sandwich is served to a well-dressed individual in a formal dining attire, with a view of the city skyline in the background.",
  "atmosphere": "The ambiance is warm and inviting, with soft jazz music playing in the background."
}```

Dec 14, 2025

I need to test this model, but have no reason to suspect it won't work similarly.

badraymen

Dec 15, 2025

Hello, where can I find these nodes in GitHub or ComfyUI Manager? thx

Jellai

Dec 15, 2025

@BennyDaBall I have code to use a qwen3 model as both LLM then passing those tokens generated directly onward to Zimage. so you can skip the decode/rencode piece.

My query:
Make me a sandwich, please. Serve it to me at a fancy restaurant, but I want a beer and curly fries with it.

Send it back. Those fries are crinkly!

Is this something you coded? Is it a publicly available node?

Dec 15, 2025

Nodes are not yet released, sorry, will post when they are...
and hey, Zimage doesn't always get the order exactly right, but it's darn close. Much better than a simple "make me a sandwich" prompt gets with stock Zimage encoder.

Owner Dec 15, 2025

I need to test this model, but have no reason to suspect it won't work similarly.

My model is super rough from abliteration and quick training on a messy dataset - don't expect any sort of multi-turn conversation with it or for it to adhere to any sort of structured/defined output properly except for the instruction-->seed prompt-->enhanced output single turn format. To be honest this was my first LoRA, it was more about the process of putting it all together in a day and getting it to do the one thing at least most of the time. 😄

I love what you seem to have done! So looks like you're using local model to synthesize output in a coherent json format... This is done by forcing structured json format output and system prompt alone? pretty cool - following for that node ;)

Dec 16, 2025

Thank you so much for your work. I've created a simple ComfyUI node to use it for prompt expansion and refinement, and it works exceptionally well.
https://github.com/flybirdxx/ComfyUI-Prompt_Helper

Dec 16, 2025

Thank you so much for your work. I've created a simple ComfyUI node to use it for prompt expansion and refinement, and it works exceptionally well.
https://github.com/flybirdxx/ComfyUI-Prompt_Helper

Excellent job! Works really well. I might borrow some of this code!

@BennyDaBall if you retrain this... 'camera' is a bad word to Zimage... it draws a camera.
example:
A captivating black and white, cinematic still is meticulously crafted, focusing on a single subject: a woman with jet-black hair, dressed in a sleek red dress and long leather gloves, leaning over a bar. She is the primary focus, with patrons subtly blurred in the midground. The background reveals a dimly lit speakeasy, with the smoke from the bar and flickering candlelight creating a dramatic atmosphere. The scene unfolds during the roaring 1920s, with the jazz music subtly playing in the background. The camera, a Nikon Z9, is positioned at eye level, with a 24-70mm f/2.8 lens set at f/2.8, achieving a shallow depth of field. The shot is a medium close-up, maximizing the subject's expression. The lighting rig produces deep shadows and highlights, creating a vintage aesthetic. The image is rendered with clean lines and high resolution, free from artifacts.

[original prompt:
In a stunning black-and-white, noir-inspired image, the viewer finds themselves peering into a smoky speakeasy nestled within an urban jungle during the roaring 1920s. The focal point of the scene lies on the bar, where a woman with jet-black hair, dressed in a sleek red dress and long leather gloves, leans seductively over the countertop. She exudes confidence and mystery as she expertly mixes cocktails for patrons scattered around the dimly lit room. The speakeasy is filled with smoky haze and flickering candlelight, casting dramatic shadows on the worn wooden walls, antique mirrors, and vintage brass fixtures. Intricate patterns adorn the floor tiles as jazz music plays in the background. Through a glass door at the back of the room, glimpses can be seen of another world - a dancefloor bustling with couples dancing the Charleston beneath colorful streamers and twinkling lights. The viewer cannot help but feel drawn into this vibrant yet dangerous underworld that pulses with the energy of an era gone by.

badraymen

Dec 16, 2025

Hello,
I absolutely appreciate the creativity introduced in the Qwen3 model—it is genuinely ingenious and a work of genius. Bravo! 🫡
I have some feedback regarding my experience, a specific technical question, and a request for guidance.

Question: Usage as CLIP Text Encoder
I tried using your model as a CLIP Text Encoder in my workflow.

The Issue: The final images turn out "broken" or fried. Interestingly, the generation looks perfect in the early steps (1→5) but the defects appear in the later steps.
The Positive: Despite the broken output, the scene understanding in the CLIP text encoder is exceptional. The lighting interpretation is 10/10 ("oufff" level!). It helps ControlNet understand the complex details of crowded lines in Canny and HED preprocessors significantly better.
My Context: I am a jewelry retoucher/creator, so I am specifically referring to intricate details like diamonds, precious stones, prongs, and pavé settings.
My Question: Is it possible/intended to use this model as a CLIP text encoder? If so, do you know why the results degrade after the first few steps?

Workflow Context
I downloaded the "Prompt Helper Engineer" node (created by our wonderful friend) and ran it locally using your model. It has radically changed my workflow.

Before: Image Input (x2 Product + Scene) → Qwen3 VL 4b Instruct → Weak Prompt Output = Mediocre results/static scenes (even with different seeds).
Now: Image Input (x2 Product + Scene) → Qwen3 VL 4b Instruct → Prompt Helper Engineer (Your Model) → JSON Strong Prompt Output = Incredible results, variety, and creative/realistic scenes (even with one seed).

My Request: Prompt Structure Guidance
The reason I am specifically asking you for these prompt structures is that I noticed the genius way you write prompts that your model understands (based on your previous comments and the examples you gave). I want to ensure I am using that same logic.
Could you please provide 3 specific prompt structure guides (system instructions) ideally in JSON format:

For Qwen3 VL 4b Instruct (Vision): A structure to make it precisely describe my jewelry from Image 1 (Gemstone type, count, positioning, pavé type, metal type) and precisely describe the scene in Image 2 (Lighting, angle, color palette, elements). Note: I know you mentioned Qwen hates camera specs, but I need it to output data that your text model can process.
For Your Model (Text Generation): A prompt structure ideal for generating prompts for Advertising, Ads, and Commercial Photography.
For Your Model (Text Generation): A prompt structure ideal for Commercial Jewelry Retouching (White background with shadow).
Thank you so much for your work!

Dec 16, 2025

I tried using your model as a CLIP Text Encoder in my workflow.

The Issue: The final images turn out "broken" or fried. Interestingly, the generation looks perfect in the early steps (1→5) but the defects appear in the later steps.

Yes, I can confirm, sometimes (not always just sometimes) the image from Engineer is 'crispy', especially compared to the other clip models I'm using.
I'll include a few of these in my sample grids in the other post.

The reason is unclear, but based on my extensive playing with Zimage itself, it's likely a Shift issue (Zimage has a shift value like Flux, but by default it's mostly set to '3', which is not always correct. Add the AuraFlow shift node and you can adjust, and I'll be releasing a Zimage specific Autoshift (similar to a Flux one I built that I also need to release)

Secondly, yesterday, I discovered that 'clipskip' is alive and well with Zimage... This used to be part of the old SD/SDXL 'tricks', and there is a stock node in ComfyUI called "ClipSetLastLayer" you can use. In the old days, a clip skip of -1 or -2 was common... turns out with Zimage, I've had good results all the way thru to -24 (the max in the stock node), and when I forced it past that to -30 (out of 36 layers), it was back to noise/broken, so the -24 max isn't awful anyway... Funny enough this also can help with the crispy problem.

As for the rest of your post, @badraymen , you can (and should) change the system prompt to focus on your own goals. It'll take some learning how much of what you want the model to understand/do, but that's the first step... Once you identify what the model lacks, you can either add it to the system prompt if possible, OR you can learn to retrain a Qwen3 4B image yourself, there are docs on how to do it. Expecting @BennyDaBall to do that for you is not reasonable. Sorry but true.

My node above will be released (maybe by Jan) and it will allow similar instructions, perhaps you can use that.

badraymen

Dec 16, 2025

I tried using your model as a CLIP Text Encoder in my workflow.

The Issue: The final images turn out "broken" or fried. Interestingly, the generation looks perfect in the early steps (1→5) but the defects appear in the later steps.

Yes, I can confirm, sometimes (not always just sometimes) the image from Engineer is 'crispy', especially compared to the other clip models I'm using.
I'll include a few of these in my sample grids in the other post.

The reason is unclear, but based on my extensive playing with Zimage itself, it's likely a Shift issue (Zimage has a shift value like Flux, but by default it's mostly set to '3', which is not always correct. Add the AuraFlow shift node and you can adjust, and I'll be releasing a Zimage specific Autoshift (similar to a Flux one I built that I also need to release)

Secondly, yesterday, I discovered that 'clipskip' is alive and well with Zimage... This used to be part of the old SD/SDXL 'tricks', and there is a stock node in ComfyUI called "ClipSetLastLayer" you can use. In the old days, a clip skip of -1 or -2 was common... turns out with Zimage, I've had good results all the way thru to -24 (the max in the stock node), and when I forced it past that to -30 (out of 36 layers), it was back to noise/broken, so the -24 max isn't awful anyway... Funny enough this also can help with the crispy problem.

As for the rest of your post, @badraymen , you can (and should) change the system prompt to focus on your own goals. It'll take some learning how much of what you want the model to understand/do, but that's the first step... Once you identify what the model lacks, you can either add it to the system prompt if possible, OR you can learn to retrain a Qwen3 4B image yourself, there are docs on how to do it. Expecting @BennyDaBall to do that for you is not reasonable. Sorry but true.

My node above will be released (maybe by Jan) and it will allow similar instructions, perhaps you can use that.

Thank you so much for the quick and insightful response! 🙏

I'm truly relieved to hear that you have experienced the "crispy" output sometimes as well. Knowing this issue is recognized and potentially linked to Zimage/Shift/Clipskip is fantastic.

Regarding the CLIP Encoder Issue
Shift Issue & AuraFlow: I will definitely look into the AuraFlow shift node and your upcoming Zimage-specific Autoshift. This makes a lot of sense, as the scene description is perfect, but the final texture is what breaks down.

ClipSkip / ClipSetLastLayer: This is a crucial piece of information! I hadn't considered that ClipSkip might be so influential with Zimage. I will immediately try using the standard "ClipSetLastLayer" node in ComfyUI and experiment with negative values, especially aiming for the -24 range, to see if it resolves the "crispy" problem.

Regarding the Prompt Structure Request
I completely understand your point about me adjusting the system prompt myself and not relying on you or @BennyDaBall to fine-tune the model for my niche. That is perfectly reasonable and fair.

My request wasn't to ask you to do the work, but rather to ask for guidance on the optimal structure or logic that your model best responds to (i.e., what makes it "genius" at understanding its own prompts).

However, based on your advice:

I will focus my efforts on experimenting with the existing system prompt within the Prompt Helper Engineer node to tailor it to my jewelry/commercial photography needs.

I'm also really looking forward to the release of your new node (hopefully by Jan!), as I'm sure it will offer the structural flexibility I'm looking for.

Thanks again for taking the time to explain the technical details and for the clear guidance on the system prompt strategy. This is extremely helpful!

Owner Dec 16, 2025

@scruffynerf Blown away by your comparisons!

I realized the "camera" thing right after generating the dataset and testing the LoRA. 😅 I need to regenerate with an improved system prompt, concepts, modifiers, etc. - but generating useful synthetic data is expensive... I may make a V1.1 after search/removing all camera references from the dataset, but I am working on generating a new dataset locally.

deleted

Dec 18, 2025

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

Perfs

Dec 19, 2025

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

can you drop a WF to work with please?

Dec 19, 2025

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

This is absolutely me!

Perfs

Dec 19, 2025

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

This is absolutely me!

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

can you drop a WF to work with please?

which node r u talking here, also can u check this 1?

Dec 19, 2025

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

This is absolutely me!

Thanks for providing the link to the custom node for this model, @shadwosi, it works amazing. Though I don't think you created it, because the author of the custom node is flybirdxx.

can you drop a WF to work with please?

which node r u talking here, also can u check this 1?

https://github.com/BennyDaBall930/ComfyUI-Z-Engineer

Owner Dec 19, 2025

I dropped my own openai endpoint compatible node! Try it out.

deleted

Dec 19, 2025

•

edited Dec 19, 2025

@BennyDaBall i just installed it and tried it, but it says "You are a helpful assistant" below the prompt. Am I supposed to change it to something better, like do you have a system prompt?

Edit: I ran it, and it gave a weird image output completely unrelated to my prompt:

Edit 2: I think I might have misunderstood your custom node, I am responsible for creating the vLLM or LLM Studio to host a LLM for optimizing the prompt? I would stick with @shadwosii custom node, which works and i dont have to do any setup.

Owner Dec 19, 2025

•

edited Dec 19, 2025

@sk0d My node is for people already running a separate instance hosting a local LLM. If you're already running a local llm server, like LM Studio, Ollama, simply put in your base URL:port and let it rip! You can use this to access an LLM on localhost or another local endpoint on your network.

The recommended system prompt for Z-Engineer is located in the files section of this repo and on the repo main page - I mention it on the github. You can "bring your own" system prompt and local model, or use the Z-Image-Engineer-V2 model and system prompt that has been trained specifically for prompt enhancement.

I could have put in a built-in model loader, but I wanted it to be eco-system and device agnostic (i.e. use it on PC (AMD or Nvidia), Mac, Linux) and lightweight. Like right now - I am running Z-Engineer model in LM Studio on my Mac in MLX format and serving it on my local network and generating in ComfyUI on my PC with no additional resources being used.

I thought about a combined loader/clip but there isn't really a point, and this way there are MORE variables and knobs to tweak to create different variations.

EDIT: Do you have examples of it producing unrelated images? The GGUF Quants are a work in progress and haven't been tested fully by me so if you're using one of those I recommend the full weight V2 located in the files section, MLX if on Mac, or Q8 GGUF (tested).

deleted

Dec 19, 2025

•

edited Dec 19, 2025

@BennyDaBall I get it, I thought ur node was just the QWEN-3 Z Engineer, and i found out it was a flexible node.

I was using it wrong and didn't have any LLM running in my machine to point it to, that's why it gave the unrelated image output, so it was my mistake. I have been using the other guy's custom node and the f16 gguf.

stablecollector

Dec 30, 2025

•

edited Dec 30, 2025

In the official node of this project
In the prompt that says "Assistant," should I enter the text from the system_prompt.json file?
Thanks for your time.

EDIT:
it gives me this back:
"Error: Invalid URL '/chat/completions': No scheme supplied. Perhaps you meant https:///chat/completions?"

Owner Jan 4

In the official node of this project
In the prompt that says "Assistant," should I enter the text from the system_prompt.json file?
Thanks for your time.

EDIT:
it gives me this back:
"Error: Invalid URL '/chat/completions': No scheme supplied. Perhaps you meant https:///chat/completions?"

Yes, for the system prompt. Just copy/paste from the json. The official node is NOT a model/clip/LLM loader - you must be serving the model on your own (locally).

V3 dataset generation and training pipelines have been completed and now I just need to find the time to get it all done (or dispatch an agent to do it, but that's not as fun!). Once that's all setup I'll be making a new/updated node with integrated clip model loader, but I plan on keeping the Z-Engineer inference out of the ComfyUI node.

deleted

Jan 5

This comment has been hidden

Jan 5

Benny can correct me if I'm wrong, but I'm not aware of any MLX Clip Loader nodes... so there is no way to use MLX version of the model as the clip loader, only GGUF or normal clip loader (for safetensor versions)

Again, Engineer has 2 different and distinct uses:

As a Prompt creator/enhancer, functioning as an LLM. This is JUST to make prompts. Can be run with a variety of nodes in many ways, some entirely within ComfyUI, some using VLLM, Ollama, etc methods externally.
As a Clip Model. Using GGUF loader (City96 repo is best), or normal safetensor version of Engineer model, this replaces the 'stock' Qwen3 clip text encoder and does work well to change the resulting image. I find it makes much nicer composed images, and adds complexity and better poses/etc to the image.