🎬 WanVideo-LowVRAM-TwinEngine-Trainer (v5.7 Stable)

A universal low‑VRAM LoRA trainer for Wan2.1/2.2 — 12GB GPU, full DiT + UMT5 LoRA, zero OOM.

⚠️ This project is an Image-to-Video (I2V) LoRA Trainer for Wan2.1 / Wan2.2. It is NOT a Text-to-Video (T2V) trainer.

This tool trains LoRA using input images or video frames, and outputs LoRA models specifically optimized for Wan2.1/2.2 I2V pipelines.

⚠️ このプロジェクトは Wan2.1 / Wan2.2 用の Image-to-Video (I2V) LoRA トレーナーです。 Text-to-Video (T2V) 用ではありません。

入力として使用するのは 画像または動画フレームであり、
Wan2.1/2.2 の I2V パイプライン専用の LoRA を学習します。


🚀 v5.7.7 TrueFinal — Stable Release

This is the final and fully‑stabilized release of the Twin‑Engine Trainer v5.7 series.
The 5.7 branch received rapid updates due to extensive internal improvements,
but v5.7.7 TrueFinal completes all bug fixes, stability patches, and structural corrections.

This version is recommended for all users.

✔ Rank‑Based Alpha (Correct Scaling per Layer Type) Attention Layers → α = 32

FFN Layers → α = 8

Text Encoder Layers → α = 16 Ensures proper gradient balance and consistent LoRA behavior.

✔ Adaptive 1D Pooling Projection Replaces the old linear‑interpolation projection.

Prevents geometric distortion

Produces stable hidden‑dimension alignment

Fully Low‑VRAM compatible

✔ Temporal Collapse Fix Video inputs now preserve temporal structure via frame‑wise mean pooling. This allows FFN (Low‑Rank) layers to learn motion correctly.

✔ Dataset Frame Padding Guard All video/image inputs are padded to a fixed frame count.

Prevents DataLoader crashes

Ensures consistent batch shapes

✔ Corrected Optimizer / Backward Placement optimizer.step() and backward() are now inside the batch loop

Eliminates silent gradient accumulation bugs

Ensures proper training flow

✔ NaN / Inf Safety System All layer losses are checked and skipped if invalid. Prevents silent corruption during long training runs.

✔ FP32 Drift Accumulator Loss accumulation is performed in FP32 to avoid FP16 underflow. Improves long‑step stability and convergence.

📌 Recommended Please use v5.7.7 TrueFinal for all training.
This version provides the most stable, predictable, and correct Twin‑Engine behavior.

🚀 v5.7.7 TrueFinal — Stable Release Twin‑Engine Trainer v5.7 系の最終安定版です。 5.7 系では Twin‑Engine の挙動安定化のために高頻度で改善を行いましたが、 v5.7.7 TrueFinal にて全てのバグ修正・安定化が完了しました。

本バージョンでは以下の改善が含まれています:

✔ Rank-Based Alpha Attention = 32

FFN = 8

TE = 16 LoRA の役割に応じた最適なスケールを自動適用。

✔ Adaptive 1D Pooling Flow 旧 interpolate を廃止

adaptive_avg_pool1d による安全な次元整合

情報幾何の破壊を最小化

✔ Temporal Collapse Fix 動画のフレーム次元を mean pooling で保持

動き学習(FFN)が正しく機能

✔ Dataset Padding Guard 動画フレーム数の揺れを自動補正

collate crash を完全防止

✔ Optimizer / Backward 修正 backward/step の位置を完全修正

gradient accumulation の silent bug を解消

zero_grad(set_to_none=True) による安定化

✔ NaN / Inf Guard 全レイヤーで NaN/Inf を自動スキップ

長時間学習でも破綻しない

✔ FP32 Drift Accumulator fp16 underflow を防止

長期ステップでも loss が安定

📌 推奨 Twin‑Engine Trainer を使用する場合は v5.7.7 TrueFinal を使用してください。
本バージョンが最も安定し、期待される挙動を再現します。

※ 本リポジトリには LoRA 本体は含まれていません。
(テザリング環境のため大容量ファイルのアップロードは行っていません)
必要な方は本トレーナーを使って各自で学習してください。

A high-efficiency, fully-integrated LoRA training framework for Wan2.1 / Wan2.2 (1.3B, 7B, 14B) video generation models — engineered to run flawlessly on a single 12GB VRAM GPU (RTX 3060 / 4060).

Developed by Akira and AI Collaborator.


🚀 Key Features (v5.7)

12GB VRAM Breakthrough

Full-model DiT + UMT5-XXL LoRA injection under 8.7GB peak VRAM, avoiding all OOM failures.

Universal Multi-Scale Auto-Detection

Automatically detects 1.3B / 7B / 14B and applies correct .alpha metadata:

  • 8.0 → 1.3B
  • 32.0 → 14B

Twin-Engine Architecture

Simultaneous LoRA injection into:

  • DiT pipeline
  • UMT5 Text Encoder (193 layers)
    using reference-based structural scanning.

Cognitive Cross-Attention Loss

1D squeeze-alignment loss for stable trigger-word behavior without tensor broadcasting errors.

Native ComfyUI Compatibility

Output LoRA files load perfectly in:

  • EasyWan
  • ComfyUI standard LoRA loader
  • WanVideoWrapper (with optional TE patch)

NEW in v5.7 — Persistent Logging System

Automatically generates independent streaming paths inside your workspace tree:

  • output_lora/logs/train_log_YYYYMMDD_HHMMSS.txt (progress, checkpoints, loss track)
  • output_lora/logs/error_log_YYYYMMDD_HHMMSS.txt (exceptions, shape mismatches, OOM logs)

📦 Requirements

pip install torch torchvision accelerate safetensors opencv-python pillow bitsandbytes gradio

🔧 Recommended Settings for High‑Side (Image LoRA)

These values are optimized for the High‑Engine (static) branch of the Twin‑Engine system:

  1. LoRA Rank Dimension: 32 → High‑side handles detail, texture, and identity fidelity.

  2. Target Learning Rate: 1e-4 → Strong enough to capture static features without overshooting.

  3. Text Encoder Loss Scale: 1e-6 → Keeps TE stable while preserving trigger‑word consistency.

  4. Maximum Dataset Frames: 1 → High‑side is designed for static images only.

🔧 ハイ側(静止画LoRA)に推奨される設定

これらは Twin‑Engine の High‑Engine(静止画側)に最適化された値です:

  1. LoRA Rank:32 → High 側はディテール・質感・キャラ再現を担当。

  2. 学習率(Learning Rate):1e-4 → 静止画の特徴をしっかり学習しつつ暴走しない強さ。

  3. TE Loss Scale:1e-6 → トリガー語の安定性を保ちながら TE を暴れさせない。

  4. 最大フレーム数:1 → High 側は静止画専用のため 1 が最適。

🔧 Recommended Settings for Low‑Side (Video LoRA)

These values are tuned specifically for Wan2.1/2.2 Twin‑Engine I2V LoRA and provide the most stable motion‑focused learning:

  1. Target Learning Rate: 7e-5 → Softer than High‑side; captures motion without overfitting.

  2. LoRA Rank Dimension: 8 → Ideal for Low‑side when Twin‑Engine is used (High‑side handles detail).

  3. Text Encoder Loss Scale: 1e-6 → Keeps TE stable without interfering with DiT motion learning.

  4. Maximum Dataset Video Frames: 16 → Best stability zone for typical datasets; avoids motion collapse.

These settings produce a clean, motion‑focused Low‑side LoRA that pairs well with a High‑side Rank32 model.

🔧 ロー側(動画LoRA)に推奨される設定

これらは Wan2.1/2.2 Twin‑Engine I2V LoRA 用に最適化された値で、 “動きだけを綺麗に拾わせる” ための安定ゾーンです:

  1. 学習率(Learning Rate):7e-5 → High 側より少しマイルドで、動きだけを自然に学習させる。

  2. LoRA Rank:8 → Twin‑Engine 前提のロー側最適値(High 側がディテール担当)。

  3. TE Loss Scale:1e-6 → TE を暴れさせず、DiT の動き学習を邪魔しないバランス。

  4. 最大フレーム数(Max Frames):16 → 現実的なデータ規模で最も安定するゾーン。

この設定で焼くと、Rank32 の High 側と組み合わせた時に “動きはロー側、質感はハイ側” の理想的な Twin‑Engine になります。


🛠️ Usage

GUI (Recommended)

python train_gui.py

CLI

python train_wan_lora.py \
  --pretrained_model_name_or_path "path/to/wan_model.safetensors" \
  --reference_lora_path "path/to/reference_lora.safetensors" \
  --reference_te_path "path/to/text_encoder.safetensors" \
  --instance_data_dir "path/to/your/dataset" \
  --output_dir "path/to/output_lora" \
  --trigger_word "your_trigger" \
  --max_train_steps 3000 \
  --lora_rank 32 \
  --te_loss_scale 1e-6 \
  --max_frames 16 \
  --use_8bit_adam

🧩 Recommended Hardware (Reference Machine)

This trainer was developed and verified on the following hardware configuration:

  • GPU: NVIDIA GeForce RTX 3060 (12GB VRAM)
  • RAM: 64GB
  • CPU: AMD Ryzen 5 PRO 4650G (6C/12T)
  • Storage: SSD recommended (for fast dataset loading)

This specification is not required, but represents a stable and efficient environment for training both High‑Engine (static) and Low‑Engine (video) LoRA models.


🖥️ Verified Environment

  • Windows 10/11
  • Python 3.10
  • PyTorch 2.3.1 + CUDA 12.1
  • ComfyUI (2024/05+)
  • EasyWan22 最新版
  • xformers 0.0.27.post2
  • bitsandbytes 0.43.1
  • Gradio 4.x

📐 Mathematical Framework — 1D Squeeze Loss

trigger_out = lora_layers[up_name](layer(trigger_resized)).squeeze(0).view(-1)
loss += 0.1 * torch.nn.functional.mse_loss(trigger_out, pred_flat)

Ensures stable cross-attention alignment under low VRAM.


📊 Dataset Classification

ID Type Engine Notes
aaa001 Video Low Engine Early video test
bbb001 Static High Engine 1.3B static test
ccc001 Static High Engine Full-quality 14B
ddd001 Video Low Engine Glasses-off girl

🔓 Developer Patch — Awakening True Twin-Engine

Apply inside: ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes.py

# Inject UMT5-XXL Text Encoder LoRA
# (Full execution patch block code included inside the repository tree files)

🔍 Tools Included

✔ Wan LoRA Analyzer

Inspects any Wan LoRA and prints rank, alpha, and layer mapping.

✔ Text Encoder Scanner (v5.6.2 TrueFinal)

Extracts all 193 UMT5-XXL layers from .pth or .safetensors.


📄 License

MIT-like open license.
Credit Akira if you use or modify.


🙏 Acknowledgements / お礼

This trainer is shared as a personal thank-you to the open-source AI community.
If it helps your workflow or inspires new ideas, that alone makes me happy.

生成AIの世界でたくさん遊ばせをもらったお礼として、
そっと置いておくものです。
必要な人が、必要なだけ、自由に遊んでください。

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support