Controllable Vector Video Generation via Wan 2.1 Fine-Tuning
This repository presents a parameter-efficient framework for controllable vector-style video generation by fine-tuning Wan 2.1 (1.3B) video diffusion models using LoRA.
Our goal is to enable prompt-controllable, temporally consistent vector animations under limited GPU memory constraints, making stylized video generation feasible on consumer-grade hardware.
Overview
- Base Model: Wan 2.1 (1.3B) video diffusion model
- Framework: DiffSynth-Studio
- Fine-Tuning: Low-Rank Adaptation (LoRA)
- Tasks: Text-to-Image (T2I), Text-to-Video (T2V), Video-to-Video (V2V)
- Focus: Vector-style animation, shape consistency, controllable motion
Key Contributions
- Fine-tuned Wan 2.1 video diffusion models for high-quality vector-stylized animation generation.
- Enabled text-prompt–controllable vector animations, supporting flexible semantic and stylistic control.
- Introduced a layer-wise vector animation strategy to decouple shape, color, and motion.
- Improved temporal consistency by ~25% compared to baseline Wan 2.1 generation.
- Demonstrated effective LoRA fine-tuning under 8–16 GB VRAM constraints.
Method Summary
We build upon DiffSynth-Studio and apply LoRA-based fine-tuning to Wan 2.1 by adapting only attention and feed-forward submodules:
- Target modules:
q, k, v, o, ffn.0, ffn.2 - Mixed precision training with
bfloat16 - Gradient checkpointing enabled
To stabilize stylized video generation, we introduce layer-wise vector supervision, which improves shape preservation and reduces temporal drift across frames.
Training Setup
Dataset
- 16 high-quality vector-style tiger images
- Transparent background with dense semantic tagging
- Resolution: 512 × 512
LoRA Configuration
- Rank: 16
- Alpha: 16
- Learning rate: 5e-5
- Steps per epoch: 500
- Epochs: 5
Experimental Insights
- Wan 2.1 (1.3B) provides the best trade-off between quality and memory efficiency.
- LoRA rank = 4 leads to insufficient style learning.
- LoRA rank = 16 significantly improves character identity and vector-style consistency.
- Wan 2.2 (5B) exceeds the feasible memory limit on 8 GB GPUs and results in OOM during inference.
Tiger Vector LoRA Model
The Tiger Vector LoRA Model is a LoRA adapter fine-tuned on Wan 2.1, designed for generating stylized tiger characters with clean vector outlines and stable motion.
- Model Type: LoRA Adapter
- Base Model: Wan 2.1 – 1.3B
- Optimized For: vector-style animation
- Release Date: 2026-01-18
Hugging Face:
(https://huggingface.co/jye224/Wan2.1_T2V_Tiger)
Example Prompt
We demonstrate the effectiveness of our method on T2V Tiger, a stylized fat tiger character designed to evaluate shape preservation, motion stability, and prompt controllability in vector-style text-to-video generation.
Prompt
tigersticker_unique001, adorable tiger cub, fluffy fur, playful and dynamic pose, smiling, soft lighting, smooth animation style, cute cartoon style, centered composition, simple white background, solo
|
Input / Reference |
→ |
Generated Clip A |
Generated Clip B |
Generated Clip C |
Observation.
The generated videos preserve clean vector-style outlines and consistent character identity across frames, while exhibiting diverse motion patterns driven solely by the text prompt. This highlights the effectiveness of LoRA-based fine-tuning for controllable vector-style video generation under limited computational budgets.