Question about improving anatomical consistency in SFW video generation model

#51
by folkrock - opened

I finally decided to try this model instead of WAN video, even though I had a bad experience with LTX 1. Honestly, I liked it so much that I'm even ready to completely delete WAN. However, I'm currently unable to solve one particular problem.

The NSFW version generates relatively good character bodies, but these bodies are engaged in sexual acts. Even without a partner, they... well, move suggestively. However, I don't generate content that explicit. My content is simply about bringing QWEN image generations to life, and they are occasionally at an R+ rating level (similar to the rating on Civitai – for context here, R-level implies non-explicit nudity but not pornography).

The SFW model is simply an excellent tool that almost strictly follows my prompts, but it has apparently never heard of human anatomy. And if I try to animate a picture of an already nude body, the first frames are good, but only until the angle changes. Female nipples turn into birthmarks, and the pelvic area becomes as smooth as a Barbie doll's.

My current workaround is to generate several fixed angles/viewpoints and create some intermediate frames that are, so to speak, "interpolated" by the SFW model. The overall result is quite decent. Although in this case, the dialogue audio often doesn't seem to come from the character's mouth but from off-screen, which is a bit disappointing, though not critical.

I've looked through many different LoRAs on Civitai, but they either still handle female breasts poorly or cause the characters to moan and move their hips rhythmically even when clothed. So, my question is: are there more straightforward methods or techniques to improve the SFW model's understanding of female anatomy, without relying on generating multiple fixed angles?

Owner

The "furry" LORA seems to be the main culprit pushing the suggestive movements. I'm trying to work it out of my merges. General nudity is still an issue and I agree that the current LORA landscape isn't great. The best solution for now is using guide frames, particularly the last frame.

The "furry" LORA

Wow, I didn't expect there to be a furry element in this particular merge. That's amusing. However, my personal opinion is that this is quite niche content, and it would make more sense to suggest users employ a specialized Lora for generating furry content, while orienting the NSFW model towards humans. Of course, I don't know the actual demand and could be mistaken—this is just my subjective take. In any case, thank you for the answer. Your merges are truly impressive.

Owner

The "furry" LORA

Wow, I didn't expect there to be a furry element in this particular merge. That's amusing. However, my personal opinion is that this is quite niche content, and it would make more sense to suggest users employ a specialized Lora for generating furry content, while orienting the NSFW model towards humans. Of course, I don't know the actual demand and could be mistaken—this is just my subjective take. In any case, thank you for the answer. Your merges are truly impressive.

Ha, I've got nothing against Furries 😀 The "furry" LORA was one of the first NSFW LORAs out for LTX2 and it covered quite a bit of motion beyond furries, hence previously being included.

Hmm... Well then, I guess I'm honor-bound to bring a couple of lovestruck Khajiit to life using this model! 😄

Talking about "Furry". I have huge problems with any furry references with both SFW and NSFW checkpoint - it always tries to humanize furry faces ALOT. Fully cartoonish style, or very realistic, it always makes fully human mouth and lips, and even human ears. Now its even more strange, knowing that it has furry lora.
I tried clean LTX-2 in some online services, and it worked much better with faces.

Owner

I don't think LTX2 has as much "furry" in its dataset, and I only add a moderate amount to the NSFW checkpoint. Also, I don't think the "furry" LORA is actually designed to make them speak. You can try adding more of the LORA or switch to "first to last frame" to improve consistency.

I don't think LTX2 has as much "furry" in its dataset, and I only add a moderate amount to the NSFW checkpoint. Also, I don't think the "furry" LORA is actually designed to make them speak. You can try adding more of the LORA or switch to "first to last frame" to improve consistency.

But original LTX2, not Rapid, does it good. Something exactly in rapid version ruins it. There were also alot of comments about bad face consistency compared to WAN. Maybe it could be related

bad face consistency compared to WAN

Facial (and overall) consistency in WAN (at least in the Rapid editions) also leaves a lot to be desired. And this is where LTX wins—it allows you to input not just one or two files, but essentially as many as you want, spread them across a "timeline" (if you can call it that), and also set the strength of each keyframe's influence.

This feature seemed the most interesting to me. First, LTX 2 is much faster than WAN 14b (about 5-7 times faster on my hardware). Second, it maintains consistency better thanks to keyframes. And third, when used in combination with Qwen Image Edit, you get full-fledged video direction. That is, we envision how and what will move in the video, roughly determine the number of keyframes needed to realize the idea (this skill came to me after 4-5 failed attempts, doesn't take too long to learn).

Then the workflow is as follows: for example, generate a starting frame in Qwen Image, then in Qwen Image Edit, move the camera or add/remove elements. Version 23 by Phr00t has some consistency issues, but this is fixed by adding the consistence_edit LoRA with a strength of 0.3-0.5 (better to start without it and gradually increase the weight if needed). Then spread these frames across the timeline and write a minimal prompt.

This approach rarely requires more than one video generation attempt. The resulting video turns out exactly as expected, almost frame by frame. The clips become less "unexpected"; here you can genuinely say: "I made this," it's not just the AI generating whatever it wants. There's another issue—voice consistency. Recurring characters in different videos will have different voices each time. But this is solved by voice replacement, for example, using Chatterbox or similar tools.

Overall, for me personally, LTX2 is a tool, and a pretty fast one at that. WAN 2.2 5b is a fast toy with limited capabilities. WAN Rapid AIO is a more advanced toy, capable of producing something, but without much control and about 5 times slower than LTX2.

@folkrock What are you using for multiple image insertion? LTXVAddGuideMulti ?

@folkrock What are you using for multiple image insertion? LTXVAddGuideMulti ?

Yes, that's exactly it. I took the basic workflow posted here and modified it a bit, giving it a more organized appearance, although it ended up being even more convoluted than the original and not very suitable for learning. But if a ready-to-use tool is needed, I can share it.

2026-02-03_13-10-57

@folkrock - thanks. I assumed that only max 2 inputs are suitable, never thought I can use more. How do you handle the image referencing in the prompt? Are you using something like: the scene starts exactly like in image 1, than the scene transition into action depicted in image 2, smoothly change according to image 3 and finally ending like on image 4 ? Something like that?

@010O11

Not at all, this can even be harmful, and very long prompts tend to make characters literally voice the text you wrote. It looks funny, but it's not what we expect.

I conducted quite a few experiments over the weekend (my wife and kids even wondered if I was still alive, haha). But I can definitively say that using more than 6 images spread across the timeline is overkill and only makes things worse.

Now for the specifics. You input several images. I'll explain using the example of how I made this video from five pictures: https :// t . me/eroset18/6452 (remove the extra spaces).

First, I generated a starting frame—an empty bed. I generated it in Qwen Image.
Then, I changed the angle slightly using Qwen Image Edit (I'll abbreviate it as QIE from now on) and the CameraControl LoRA. I can share the first two frames here; the other two, unfortunately, I cannot.

Next, I "rotated the camera" further in QIE, added a character in the first pose (rummaging through a closet, back/side view). Then, in QIE, I changed the girl's pose to "covering body with hands, facing forward, angry face, mouth slightly open." And finally, I generated the last frame—"hands on hips."

After that, I spread these frames across the timeline. If you watch the video, you can roughly see where I have a frame change—I honestly don't remember exactly which frame numbers I assigned to which photo in the node.

But what interests us for each image are these two fields:

frame_idx — This is the frame number where our keyframe will be placed. The first one is always 0, the last one is your total number of frames minus 1 (you need to remember this, as the count starts from 0 here).

strength — This is how accurately the model will reproduce the frame. The default is 1, but I recommend using a value of 1 only for the first frame. For intermediate frames, it's better to set it between 0.65 and 0.9; otherwise, you might notice some flickering on that frame. It's not critical, but it is unpleasant.

I place the final keyframe on the timeline at around 90% of the total frame count. For some reason, this works better than forcing it right at the last frame. And yes, with this approach, you can lower that compression/quality reduction value I mentioned earlier. I haven't found the optimal value yet, but I've even reduced it to zero compression, and everything generated beautifully.

One more thing I wanted to add: I intentionally added motion blur to the second frame in Krita. Because if you don't do that, LTX will likely change the scene entirely instead of making a sharp camera turn like in my video. Yes, that's how it works.

And the prompt was something like this:

A thoughtful voice off-screen says: "А сейчас мы будем делать контент" then the camera makes a sharp turn to the right, a woman in the frame turns around and shouts indignantly, "Почка, ты совсем конченная? Убери телефон!"
(I wrote the prompt entirely in Russian, using Russian phrases and idioms that most likely differ significantly from this AI translation, so there's no need to pay attention if there are any errors from the standpoint of English grammar rules)

That's essentially all. There's no need to specify what's in which picture; the model will figure it out on its own.

Без имени

Без имени1

After the post-processing in Krita:
Без имени11

@folkrock Very generous of you, thank you. Unfortunately the telegram link is no longer available, but your detailed description is more than enough to catch the flow. I must take a look at the 'LTXVAddGuideMulti' and how to adress/index the input images, what the parameters do/mean, which doesn't have anything to do with Phr00t's work, so it shouldn't be discussed here any further. Once again, thank you ;-)

Sign up or log in to comment