Diffusion Single File
comfyui

Still.. qwen3.5? but in different context

#142
by sorryhyun - opened

Hi, I want to open this 'request for integrating Qwen3.5/3.6' as a separate thread from #67 because I think it's a different question. #67 asked whether swapping the text encoder improves current image quality, and tdrussell's experiment answered that fairly conclusively: ~95% recovery on Qwen3.5-2B, no clear win on loss or prompt comprehension, ~2 more weeks of training to fully recover artist/character knowledge. That assessment seems right.

But I think the situation got changed. Recently Z-Image team recently published code paper their full continuous-tuning recipe for step-distilled models. Anima Turbo (CFG 1, 8-12 steps) is exactly the setting they target. Their Table 1 shows that vanilla SFT on step-distilled models drops Quality-S from ~3.51 to ~2.42 β€” which is roughly what people in the community have been observing when training LoRAs directly on Turbo. D-OPSD avoids that collapse by self-distilling on the student's own roll-outs, but the method requires the encoder to jointly encode (text, target image) so the teacher branch has a stronger signal than the student. Qwen3 0.6B can't do that.

image

And actually authors did mention about alleviation by integrating native multimodal LLM instead of re-weighting Qwen3 to Qwen3-VL in their paper, so I think this is a quite confident hint.

Not asking for an immediate decision, especially given how much work the 2B experiment already cost. Mainly wanted to put this framing on the record as a different question from #67, in case it's useful for any future direction-setting.

I'm not simply criticizing! it's actually an interesting topic. Considering the model size, the 0.8B variant might seem like the better choice,

but even then, the recovery rate was measured at a time when training was far less advanced than it is now. So to recover to that level again, or to reach the current Preview 3 standard, significantly more training would likely be required. On top of that, given that the training pipeline, LoRA setups, and various other environments have already been considerably developed and solidified, I'm not sure whether restructuring the architecture to accommodate a distillation-based approach would constitute a sufficient trade-off.

@nagarago I agree. Moreover qwen3.5 is native multimodal so its behavior seems quite different from qwen3 variants. And I guess that's the key barriers. But, also integrating image representation to current cross-emb or text encoder side is not easy. So there can be a clear tradeoff I guess.

Sign up or log in to comment