I think it's a dud
It's nice having all the models in one file, but I think this should have simply been done with WAN 2.1 alone or WAN 2.2 alone, because the WAN 2.2 parts don't show through. It's basically just a mangled WAN 2.1. You can see in comparisons that a lot of things in a scene get garbled, and movement is generally stunted compared to WAN 2.1, and WAN 2.1 is already stunted in comparison to WAN 2.2. So there's simply nothing WAN 2.2 about this model. Maybe a merge without WAN 2.1 would be better.
Maybe you merged 2.1 out of the issue with the Light X2V LoRA not being tuned for it, but the fix is to up the LoRA strength to something like 1.64 or 1.688 (that's what I learned in my testing). Still it isn't perfect, but it fixes a lot of the issues with trying to run the Light X2V model at a strength of 1. Outside of that, most other LoRA's work fine with real WAN 2.2, especially at a strength of 0.9. Also, there's a Rank256 Light X2V LoRA out there, which does improve image quality. It's noticeably better than the default Rank64 version. Also, I think PUSA might be a little detrimental, it has a tendency to alter things like faces slightly enough that it doesn't even look like the same person anymore, so it may hurt more than it helps, and it's really not needed to add motion to WAN 2.2 anyways, because WAN 2.2 is all about dynamic motion already.
But after extensive testing, I just don't see it as a successor to WAN 2.1, let alone WAN 2.2. Try focusing purely on merging the 2.2 high and low noise models, with the LoRA's tuned to try to fix issues.
I've been playing with mixing parameters today, trying to get something usable while pulling in more WAN 2.2 features. It is very sensitive. I am using the Lightx2v Rank256 already. Overall, I do think what available now is bit better than WAN 2.1 and definitely easier to use as an "all in one", so imho nowhere near a "dud". However, it doesn't have as much WAN 2.2 as I was hoping for and I hope to do better.
I find using sa_solver/beta AND prompt weighting helps movement, which is worth experimenting with.
Also, you can't do an "all in one" with just WAN 2.2 without some fancy merging to get it down to 1. I haven't been successful just mixing "high" and "low" WAN 2.2 yet... pulling in WAN 2.1 helps bring in a "1 model" foundation (and also better WAN 2.1 LORA compatibility). So, lots of factors being considered here.
WAN 2.2 low noise is essentially WAN 2.1, but with more training and a slightly different scheme. That's why merging WAN 2.1 is not only futile, but detrimental. It's like if two popular SDXL models were merged, and then re-merged with one of the original models that was used to create them. Maybe your solution of how to merge high/low is in the sampling steps. Depending on how you merge it for example, it may require that the low noise and high noise parts would need different sampling at different stages, with say, a dual sampling scheme in a workflow. We've already seen such workflows, where WAN 2.1 gets sampled for with X sampler, the preview is noisy, but then gets sampled further with some scheme that cleans it all up. I haven't seen a workflow use that method for some time, but its been done. It was also used as a way to speed up generation time, so that could probably even work here to speed things up more. Keep in mind, I'm not talking about the current scheme being used in full WAN 2.2 workflows. I mean the entire video gets sampled in a rough way all the way until all the frames are done, then gets resampled again. That could play into how you decide to manage the blocks when merging, then you could require that special sampling scheme.
If you should be back merging anything to the model, it probably should be the low noise model getting re-merged back to a merge of the high/low noise models. But I think the answer would probably be more complex than that. The only true way to get as close to WAN 2.2 quality with a merge is to do it without any additional merges (other than LoRAs).
"WAN 2.2 low noise is essentially WAN 2.1, but with more training and a slightly different scheme. That's why merging WAN 2.1 is not only futile, but detrimental."
Have you even tried merging yourself? My first attempts were just using the WAN 2.2 model and I was not getting good results until I started mixing in WAN 2.1. Keep in mind I'm also trying to work with accelerators and LORAs designed for WAN 2.1. It is more of an art than science.
Feel free to try making up some mixes of your own, it isn't very straightforward!
No I get it, the accelerator is a problem too, I'm just trying to brainstorm so that maybe you find a solution. I know that if for example, I run the low noise model only, the accelerator takes a strength of 1.688 before it starts to behave like WAN 2.1, instead of just completely altering the scene, but even then it's not perfect and can mess up regularly, just not as much. Any higher or lower and things get wonky. So given that fact, the accelerator probably needs to at least be run at different strengths for each model, and even that wouldn't be perfect. I think that's what people may not realize, is that the acceleration motion errors start right at the first model, then get compounded/reinforced at the second model. So I would think this would also be an issue in a merge.
Keep in mind, you can get a pretty clean output just from the low noise model alone (just not perfectly coherent with the current acceleration lora), but of course the high noise model matters too. You however can't really run just the high noise model on its own, it's simply too noisy. In essence, one model guides the other. One for overall base motions, other for finer details. I think they are essentially the same model, just one with a noisier profile than the other, to try to create more dynamic shifts in the output.
If you're going to merge WAN 2.1 in, I'd probably omit the low noise model entirely. That way you get the stability and compatibility of the WAN 2.1 model, but the variety the high noise model delivers. Or I'd just merge WAN 2.1 and WAN 2.2 low noise. But again, the high noise model is I think where the "magic" happens in WAN 2.2. By trying to merge all 3, I think you essentially run into a 3 body problem.
FYI I've gone from v2 to now v3, so I think the quality has greatly increased since this discussion.