Sentinel7
/

ltxv

Model card Files Files and versions

xet

Sentinel7 commited on Apr 24

Commit

676e761

verified ·

1 Parent(s): 14db0fa

Upload 2340297/2632394/README.md with huggingface_hub

Browse files

Files changed (1) hide show

2340297/2632394/README.md +166 -3

2340297/2632394/README.md CHANGED Viewed

@@ -1,10 +1,173 @@
 ---
 license: other
 tags:
-- civitai
 ---
 Author: [tenstrip](https://civitai.com/user/tenstrip)
-Model: [https://civitai.com/models/2340297?modelVersionId=2632394](https://civitai.com/models/2340297?modelVersionId=2632394)
-Mirror: [https://civarchive.com/models/2340297?modelVersionId=2632394](https://civarchive.com/models/2340297?modelVersionId=2632394)

 ---
 license: other
 tags:
+- action
 ---
+# LTX2 - Oral Suite - i2v - 2632394
+**Model Type**: LORA
+**Base Model**: LTXV2
+**Trigger Words**: None
+**Tags**: action
+## Gallery
+<table>
+  <tr>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/b86e0e40-bd77-428d-8dde-b0fef23c3a49/original=true/119002789.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/a125c07b-492e-453e-95a4-bb9c8851aca7/original=true/119002967.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/fdc04a44-7a66-4eed-ae48-bca663b956cc/original=true/118863114.mp4" width="200" controls muted autoplay loop></video></td>
+  </tr>
+  <tr>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/ddfedfa4-1997-4658-8c42-9fe0de467629/original=true/119003486.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/8dd7b230-397d-4748-9029-1b7810951b63/original=true/118862999.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/eb2dd379-298a-40b9-9a3b-23984e54336e/original=true/119003560.mp4" width="200" controls muted autoplay loop></video></td>
+  </tr>
+  <tr>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6e253f30-0e16-4217-950e-11b749204ff1/original=true/119003901.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/4b0b6a94-1523-4e54-8a86-eb2a16deac81/original=true/119019288.mp4" width="200" controls muted autoplay loop></video></td>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/e7597e17-91e0-42cd-8720-cd2b3802767f/original=true/119335010.mp4" width="200" controls muted autoplay loop></video></td>
+  </tr>
+  <tr>
+    <td><video src="https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/9a1600e1-c644-4663-a3bc-4644403ae9a1/original=true/119591156.mp4" width="200" controls muted autoplay loop></video></td>
+  </tr>
+</table>
+## Description
+This works on LTX2.3 as well, sometimes even better. But you'll want to use LTX2LoraLoaderAdvanced and keep the *audio* and *audio\_to\_video* keys strength below 0.2-0.3, separate from all the other higher motion strengths to keep the improved 2.3 audio details and prompt.
+This is actually just a test run for repeats x clips, and at what level of train do you get fine-tune level fitment to data with very default settings for this. tl;dr: if you want your lora to move and look like your data then just aim for 100+, probably more like 200 repeats, especially if it's full of variance and multiple concepts. Full training spec is below. **It's by no means a training guide**, just a write up and consensus I took away from it. This dataset isn't the best, I knew it would negatively impact audio, but all I wanted was to see is if you can hard-force anything into the model and if it could actually do nsfw ultimately and absorb porn videos, the answer is yes it can.
+This is only for use on generated adult/fictional characters. It does not introduce nudity, and it can't be used for any kind of smashcut shenanigans, unless you do FirstFrame-LastFrame.
+Also understand that all my examples are insanely cherry-picked. Right now this is just a very high effort model to use until finetunes and merges ultimately make it more stable for NSFW things like this. If you don't want to mess around for a long time to get good outputs, don't bother with this one yet just keep an eye on the section. I'm definitely not giving any guarantees. **Feel free to post bloopers and absolute abomination output**s if they're funny, me ego prevents me from uploading anything that isn't as good as I could get it.
+I have [this i2v workflow](https://huggingface.co/TenStrip/Workflows), based off of [this one](https://civitai.com/models/2320262?modelVersionId=2635789). It includes many small refinement tweaks to plug into different parts to get the most out of it. You can swap back to euler and use the normal custom samplers as well to get better speeds.
+**Update**: LTX added a [new guidance enhancement in their nodes](https://ltx.io/model/model-blog/ltx-2-better-control-for-real-workflows). When you use the skip\_layer feature to skip layer 29, you introduce ton of motion from the Lora even at low strengths depending on the STG strength. Still testing it.
+The way you'll want to use this at first before getting used to it's limits and quirks:
+### **Try 0.5 strength first**.
+This is incredibly fit to it's concept. A mid-level strength is much more consistent with outputs, and with prompt. Get an image of a closeup of a woman with an *erect penis* in her face, from the side, overhead, POV. LTX2 doesn't like vertical stuff and square or wide images work better. Then, start each prompt with just one word at first-
+*POV.*
+*blowjob.*
+*handjob.*
+*deepthroat.*
+*facial cumshot.*
+The placement of hands, mouths, penis, tongue all matter a lot to get certain motions. I did manually crop the data to have penis either in the mouth, or near it. As well as having the hands usually at either the top or bottom of the shaft. The kind of images you get from my Pony models with *1girl, POV, blowjob* at 1024x1024 are probably the kind of stuff you're looking for. The blowjob Flux edit loras may be useful too.
+If nothing happens at all, definitely adjust the LTXVPreprocess node upwards, but past 38 it becomes problematic usually. Don't bother embellishing the prompt too much until you find a good seed first or make sure your image is compatible with it at all. The dataset covers quite a few of the usual angles, but idk how much it can stretch.
+*POV, A close-up point-of-view static shot*
+*An angled close-up portrait*
+*A static side profile shot*
+*An overhead shot*
+For T2V, you need loras for penis, and then get it near a face somehow at these angles.
+After you get something decent with very simple prompting, you can try directing and creating more of a prompt around things like *kissing, sucking, opening her mouth wide, the man is thrusting into her mouth, cum runs down her face, she leans her face in to put her mouth on the top of the penis and starts giving oral to the man with a blowjob,* etc. The more prompt you add, the more confused, or better, it might get. This isn't uselessly inconsistent, but **results will vary.**
+Don't expect to get examples like mine without doing the LTX2 thing and refining and cherry-picking heavily. Until there's merges and finetunes for this, it's pretty high-effort. Adjust **Lora strength, img\_compression, distilled Lora strength, and then prompt**, in that order for how impactful they seem to be for fine tuning the output. To finetune the second pass, you can change all the for the second sampler only as well as plug in a new audio latent and retrack the audio.
+One issue to look out for is the distinction between the lips, tongue, and end of the penis in the image; that's a Wan2.2 issue though just carried over to this now. The other issue is when you try to get dialogue from the man, it tries to bring him into-frame with terrible body proportions sometimes because of the base model's inclination to do so when prompted with "a man...". None of the data has the guy actually in it, but probably should've to give it a sense of the placement. Refer to male speakers as *an off-camera male voice says:"x"* to bypass that. This also interferes with dialogue at high strength but at around 0.5-0.6 it seems fairly controllable.
+Another big note about prompting LTX2: *girl* is a child in the models base understanding t2v-wise. Don't use *girl* tag when training LTX2.
+If all else fails, you can take these **override prompts** which are straight-up captions that will auto-guide it or copy my example image prompts as well and edit these to fit any idea.
+* Handjob. A close-up static focus on a very cute asian woman's face as she is lying down on her back. She strokes and grips a man's penis in front of her as his large penis is pointed towards her face. She opens her mouth and sticks her tongue out expectantly while she strokes the penis with her hand sliding up and down the penis shaft very rapidly during a fast handjob. In a very cute voice she makes sexual gasps and naughty giggles and sexual moans in her cute voice while the man also groans in pleasure in his deeper voice.
+* POV deepthroat. A close-up point-of-view shot focused on a woman's face as she is between a man's open legs with his erect penis inside her open mouth. She keeps eye contact as she chokes it down and is rapidly bobbing her head up and down on to the penis as it pushes into her mouth and throat deeper making gagging noises while she grips the bottom of the penis with her fingers.
+* POV facial cumshot. A point-of-view static shot directed at a woman on the ground below the camera and positioned in front of a man's crotch with her mouth wide open and tongue out. The man strokes his own penis with his hand with the tip pointed at her face and mouth while she sticks her tongue out and eagerly awaits his load. He breaths heavily while stroking his penis, then he cums: a thick white gooey strand of cum shoots out from the tip and penis towards her face landing on the center of her face making her close her eyes as following squirts land in the same area covering the center of her face in a gooey mess as the last shots of cum land in her open mouth and on her tongue. The man exhales in relief and pleasure as she does cute sexual moans.
+* POV blowjob: An overhead close-up point-of-view static shot of a young woman who is in front of a man's crotch kneeling on the ground with his large penis in her mouth and lips sealed tightly around it. She is crotch-level and positioned in front of him with his huge erect penis in her mouth with her lips pressed tightly around it. She is bobbing her head forward and back giving oral to the man with a blowjob while she moans sexually, inhales, and makes wet throat sounds while she keeps eye contact with the camera.
+---
+For trainers:
+This is i2v, but still probably somewhat relevant for t2v, maybe the captioning part. If none of this makes any sense at all or you haven't trained anything before and want to train LTX2, start off very small and with a very focused dataset. This is a **very hard** model to train for, *at the moment*.
+ai-toolkit runpod template
+30k steps, 25 fps. Square 512 crops with mostly close ups and some upper body. Static cameras that leave the composition in mostly the same state (good to setup longer generations.)
+Not enough coverage of the actual body was done, especially the male body. You really need to show almost everything to the model like it's completely inept at depicting these thing. It puts the guys face on his chest when hes talking due to denoise restraints and because it really wants talkers to be in the scene. Caption off-camera speakers accordingly as narrators.
+* **145frame dataset (main with blowjobs/handjobs**)
+134 clips crops straight on the action or upper body. Many are just horizontal flip of data that favored a side. x1 repeat.
+This set was a pretty bad and sloppy mess of clips of BJs and a bunch of random stuff from multiple angles. It took much longer to come through and even with the repeats and high rank it'll still mix up things from angles that it really shouldn't sometimes.
+* **121frame dataset (all facials)**
+19 clips x7 repeats to give it ~40% of the weight.
+This set was all closeups from similar angles of the same thing and importantly, very **similar pacing**. A close up of a penis aimed at her face, and within about 1-2 seconds, jizz shooting out all over it. Because of how uniform that was this set trained way faster and I got results from it at only ~12k steps.
+**Un-quantized encoder and transformer**
+Not fp8. Useless because I have no comparison. Slow. No idea if this mattered at all. Also at rank 64 because I wanted max fitment and just as a baseline and what you can expect from different ranks is still really mysterious. This increases VRAM usage by a ton, barely sliding 145 frames at 512 in with 93.5/95 GB - 97.8% card usage. Probably don't do this.
+**No Differential Guidance, balanced weighted timestep**
+I got irrationally suspect of Differential Guidance, but tbh, just use it. This probably took longer to train because it was off. But then again, maybe this is what made this work so idk exactly yet.
+If there are timestep settings that fit faster, use them. I see now that training for this model needs to be turbocharged in any way possible when you train brand new concepts or foreign motions into it because 200 to get this kind of fit is just unrealistic for consumer grade hardware.
+**How it was captioned:**
+Manually. You shouldn't have to do this. You're actually better off with really lazy tagging, and instead just make your prompts more controllable and unique if you train to fit it this much. A lot of my captioning overlaps too much with different motions and angles.
+First part of every caption was the slang tag, acting like a keyword. That works, and does definitely steer the prompt right away towards the right direction on it's own.
+The rest of each caption was then an angle and composition description of the first shot. This was still assuming that this model would learn better maybe if I trained it by explaining what it was seeing with language it understood. Didn't really matter because if you train it long enough you end up just brute forcing anything in regardless of captioning. All yo want from captions is controllable prompting.
+The angles and compositions were made by **generating a 512x512 outputs in T2V**. With the base fp8 model: I took the description/caption that consistently gave a T2V output that was very similar to the dataset clip I was describing, i.e.
+*An angled static shot of a naked man lying down on his back that looks down his torso at his open legs where a naked dark-skinned Indian woman with long braided pigtails is sitting between his legs and strokes his erect penis while it's pointing straight up.*
+Obviously there was no penis, but instead a long fleshy cylinder/arm looking thing, but the rest lined up with a similar angle and composition to the data clip.
+Then, the rest of the caption was a freestyled mix of sentence terms I came up with to copy/paste on different clips like:
+*Her stroking makes soft wet noises, She strokes the penis with her hand sliding up and down the penis shaft very rapidly during a fast handjob, The man also groans in pleasure in his deeper voice.*
+The certain motions like this were also tested through T2V to see if the language had adverse qualities or pulled problematic association, and then to see if, hopefully, it's actually understood by the model. The thinking was that training is obviously helped along when it's making sense of the caption. Phrases like *she leans her face in to put her mouth on the top of the penis* were informed this way because the base model will actually do a similar motion with that prompt.
+It may be the case that auto-tagging captioning is worse, if it's not communicating the same meanings as what LTX2 understands that can certainly get in the way. Just take one of your auto captions and run it through a T2V output with the same dimension, if it's not even close, that could be an issue if you don't want to train all the way to this level of fit. But then again, sometimes this model seems like it ignores prompts a lot anyways.
+This is definitely not a good example of how to train for LTX2, but something that's evidence against quite a few misconceptions around NSFW not working at all or just how steep the model is to train for. Idk how many 30k step loras are around, but this is one. 38 hours on a 6000 too. This model can learn insanely slow; I'd recommend everyone to try quadrupling their rates or try finding a much quicker fitting setup, but haven't tried it myself but will be doing so. Reaching 30k steps for 200 repeats on average is kind of insane to get it working at a lower usage strength. Test results were pretty bad and inconsistent until ~100-125 repeats.
+With this now and seeing a lot of other posting bad nsfw results: I feel like you'll always get bad results from using a lora like this at too high a strength, **you probably want the loras to be working around 0.5-0.6** = at that strength they seem to be a lot more respectful with the prompt, more sensible with what happens in the output, and have less of that jank and adverse stuff that happens. If you're not getting any of your concept in that strength range, your loras probably undertrained. So you really do have to train the shit out of this model, long story short. Especially because right now It has no idea what it's looking at usually with NSFW motions.
+Take this with grain of salt: but the voices are telling me the current training code seems to be completely oriented towards a **single** 'style' or 'character'
+That's nice and cute because I guess that's all they want the model to be trained for. These multi-angle multi-action nsfw attempts I do and others try end up being amalgamated into just all the data being thrown on to every output usually because it has no baseline to make sense of what a woman's face covered in cum does with a penis in her mouth. Wan2.2 i2v was much better with having loras that handle multiple things at once sensibly especially when it doesn't have a T2V model underneath it trying to turn the output into a spongebob cartoon because you used the word "ocean" or something. It also had much more of a baseline with human interactions and movements to smooth things out. There are definitely optimizations there that could be done to help train multi-concept loras with different voices and varied data better. You'll get way better results with **a good dataset focused on one thing or action that is extremely concise** under ~30 clips so you can really let just cook in and aim for ~100-200 repeats depending on what it is.
+Test the lora at 0.4-0.7 range and if nothings happening, keep training. Try different tagging. Sometimes you need to increase the img\_compression to free it up if you're doing i2v during testing. **Don't test it at 1.0, you'll never get real or good honest results.** Even if it does work at 1.0, or certainly not beyond 1.0; it becomes useless and less pliable to prompt changes, disturbs the normal flow of the model, and it's just not that consistent at that strength, so don't aim for that.
+---
 Author: [tenstrip](https://civitai.com/user/tenstrip)
+Model: [CivitAI Model Page](https://civitai.com/models/2340297?modelVersionId=2632394)
+Archive: [CivArchive Page](https://civarchive.com/models/2340297?modelVersionId=2632394)
+<!-- Version: 20260424_all_update -->