Long video generation drift

#132

by timtianyang - opened 16 days ago

•

I think it's expected that using the last frame of the previous video chunk to generate the next chunk will accumulate errors overtime and indeed the long video generation (with audio) shows that. What do you think of using the same reference image or all chunks? For my talking-head use case it's just static scene with the same character. Although that kind of degrades into first-last frame with a loop workflow...

Or, do you have any thoughts on how to improve that. Appreciate it.

RuneXX

Owner 16 days ago

Yes you are correct, using last frame over and over will accumulate errors, and color drift etc.

I think the single pass workflow does what you said, using the same ref image all the way.

Actually been meaning to take a look at those workflows again. And make some options (like using same ref image if its not already there, save file per segment and merge files at end instead of accumulated frames (can potentially work better for lower ram) etc etc).

And there are also new things that might benefit the long video ones, at least optionally. Like Prompt Relay

Will take a look asap ;-)

timtianyang

15 days ago

•

edited 15 days ago

Thanks for all the work. What's your thoughts on using distilled model with higher steps, vs. the official two stage pipeline (dev model for 20 steps then distill lora for 4)? Is there any reason you prefer the former in your workflow aside from speed?

I noticed for a character in near range, some fine details for eyes or teeth are bit of hit or miss. Going through RTX super resolution doesn't help and it probably need something generative like seedvr to construct things properly. Is that something you use for post processing?

post rtx

RuneXX

Owner 15 days ago

What's your thoughts on using distilled model with higher steps, vs. the official two stage pipeline (dev model for 20 steps then distill lora for 4)

The workflows here by default has the standard LTX recommended distilled 2-stage setup.
(8 step manual sigma in stage 1, 4 step manual sigma in stage 2, euler_cfg)

But definitively are times were you might want to bump up the steps more. For example more complex motion, or not getting the details you want.
So under the manual sigma node in most (if not not all) workflows here, its an alternative sigma node where you can manually set the steps you want/need
Often use that myself, bumping the steps up to 10-12 or so.. its still quite fast.

The 2 step pipeline with 20 steps is for the Dev model (although in practice, that could easily also need some more steps often, in the 30-40 range)

Its a bit user choice. All works in the workflows here. You can easily choose to load the Dev model, and right below the model loader there is a lora loader, where that you can activate with the distilled lora.
And under the manual sigmas you find alternative sigmas that you can connect to the sampler instead using more steps. And the CFG node is already there to set higher CFG (for dev model).

So its made to be a bit flexible, and let the user choose ;-)

timtianyang

15 days ago

The workflows here by default has the standard LTX recommended distilled 2-stage setup

This is odd. Looking at the I2V_T2V_long_video_custom_audio workflow, it seems there is only one sampler for the initial frames, and another sampler in the extend video subgraph. If I understand correctly each video chunk is only sampled once. I was looking for where the spatial upscaler was actually used...

I tried to adapt the long video loop to do:
1st pass: dev model 20 steps at half resolution to make it fast and extendable to 40 steps.
2st pass: dev model + distill lora + detailer 4 steps and spatial upscale back 2x. With distill lora this is also pretty fast.
Then feed the initial images to video extention subgraph.
Inside the subgraph, scale the input images back down to we again operate at half res (I'm not sure if this is correct), do the same 2 pass workflow inside to eventually upscale back 2x.

I feel like I screwed up something. The video has very high visual quality but it's just hallucinating bamboo somehow LOL.

At the same time, there is this awkward model initialization wait as I'm switching between the dev model and dev with lora added on, and back to dev. But this is the trade-off for higher quality than using distilled model along. I wasn't able to get to the same visual quality with distill model regardless how many steps.

RuneXX

Owner 15 days ago

Been a few since i looked at those workflows, might be time to revisit them, see if things learned since, can be applied to make it even better;-)

The long video workflows come in 2 variants if i recall correctly. One that is single pass (one sampler), and other that is 2-pass (2 samplers).

The long video workflow that works best is one sampler though. This is because processing the frames at low res, then upscaling it and processing it again, leads to quite some changes.
So stitching it with the previous video is quite apparent where the stitch is.. But to remedy this, the other workflows (that use 2 samplers), blend the 2 videos with an overlap, where it crossfades.
This makes it hard to see. Single pass is more seamless though.

Your logic seems correct, not sure where the bamboo comes from. Maybe wrong sampler. Never seen that before
But will take a look at those workflows asap, they might need some love ;-)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment