ViTeX-Bench commited on
Commit
151ad29
Β·
verified Β·
1 Parent(s): f615336

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -3
README.md CHANGED
@@ -1,3 +1,203 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Wan-AI/Wan2.1-VACE-14B
4
+ pipeline_tag: text-to-video
5
+ tags:
6
+ - video-editing
7
+ - text-editing
8
+ - text-replacement
9
+ - diffusion
10
+ - wan
11
+ - vace
12
+ ---
13
+
14
+ # ViTeX-14B
15
+
16
+ **Vi**deo **Tex**t editing model based on Wan2.1-VACE-14B. Replaces text content
17
+ inside a user-provided mask region while preserving the original visual style
18
+ (font, color, stroke, shadow, perspective) and the surrounding scene.
19
+
20
+ | | |
21
+ |---|---|
22
+ | Base model | [Wan-AI/Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) |
23
+ | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
24
+ | New modules added | **971 M** (GlyphEncoder + 8 Γ— ConditionCrossAttention) |
25
+ | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
26
+ | Resolution | 720 Γ— 1280 |
27
+ | Frames | 121 (β‰ˆ 5 s @ 24 fps) |
28
+ | Training data | 230 video samples Γ— 10 dataset_repeat |
29
+ | Training | 2 epochs (576 optimizer steps), DeepSpeed ZeRO-3 + CPU offload |
30
+ | Hardware | 8 Γ— NVIDIA H100 80 GB |
31
+
32
+ ## Inputs
33
+
34
+ For each video to edit, the model needs four things:
35
+
36
+ | Input | Format | Description |
37
+ |---|---|---|
38
+ | `vace_video` | RGB video, 121 frames @ 720 Γ— 1280 | The original video containing text to replace |
39
+ | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
40
+ | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (use any font; black bg + white glyphs is fine β€” see [data prep](#data-preparation)) |
41
+ | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
42
+
43
+ The model outputs a video with the masked region replaced by the target text,
44
+ matching the original style.
45
+
46
+ ## Architecture
47
+
48
+ Built on top of frozen Wan2.1-VACE-14B (40-layer DiT + 8 VACE blocks).
49
+ Two new components are added (both trained from scratch):
50
+
51
+ ```
52
+ target text β†’ render β†’ glyph_video
53
+ ↓
54
+ Wan VAE Encoder ← shared with main video latent
55
+ ↓
56
+ GlyphEncoder ← Conv3D patch embed + cross-attn pool to 64 tokens
57
+ ↓
58
+ glyph tokens (64 Γ— 5120)
59
+ ↓
60
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
+ β”‚ for each VACE block (Γ—8): β”‚
62
+ β”‚ Self-Attn (frozen-init, β”‚
63
+ β”‚ fine-tuned) β”‚
64
+ β”‚ ↓ β”‚
65
+ β”‚ Text Cross-Attn (T5) β”‚
66
+ β”‚ ↓ β”‚
67
+ β”‚ FFN β”‚
68
+ β”‚ ↓ β”‚
69
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
70
+ β”‚ β”‚ ConditionCrossAttn β”‚ ← K/V from glyph tokens (zero-init at start)
71
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
72
+ β”‚ ↓ + residual β”‚
73
+ β”‚ after_proj β†’ c_skip β”‚
74
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
75
+ ```
76
+
77
+ The VACE conditioning input (VCU) preserves the **original masked region's
78
+ pixels** in the `reactive` channel:
79
+ ```
80
+ inactive = VAE(video Γ— (1 βˆ’ mask)) # context outside mask
81
+ reactive = VAE(video Γ— mask) # original glyphs inside mask (style cue)
82
+ mask = downsample(mask)
83
+ VCU = concat(inactive, reactive, mask) # 96 channels
84
+ ```
85
+ This lets the model see the original text's color/font/stroke and learn to
86
+ re-render the new content in the same style.
87
+
88
+ ## Installation
89
+
90
+ The model uses the modified DiffSynth-Studio repo that introduces the GlyphEncoder
91
+ and ConditionCrossAttention modules.
92
+
93
+ ```bash
94
+ git clone https://github.com/<your-org>/DiffSynth-Studio-TextVACE
95
+ cd DiffSynth-Studio-TextVACE
96
+ conda create -n vitex python=3.12 -y && conda activate vitex
97
+ pip install -e .
98
+ pip install accelerate==1.13.0
99
+ ```
100
+
101
+ Required: `torch>=2.7.0+cu128`, NVIDIA GPU with β‰₯ 80 GB VRAM (H100 / A100 80GB).
102
+ Inference uses ~ 70 GB VRAM at 720 Γ— 1280 Γ— 121 frames.
103
+
104
+ ## Usage
105
+
106
+ ```python
107
+ from huggingface_hub import snapshot_download
108
+ import torch
109
+ from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
110
+ from diffsynth.core import load_state_dict
111
+ import glob, os
112
+
113
+ # 1. Download base + this model
114
+ base_dir = snapshot_download("Wan-AI/Wan2.1-VACE-14B")
115
+ vitex_dir = snapshot_download("ViTeX-Bench/ViTeX-14B")
116
+ ckpt_path = os.path.join(vitex_dir, "vitex_14b.safetensors")
117
+
118
+ # 2. Build pipeline
119
+ diffusion_shards = sorted(glob.glob(os.path.join(base_dir, "diffusion_pytorch_model-*.safetensors")))
120
+ pipe = WanVideoPipeline.from_pretrained(
121
+ torch_dtype=torch.bfloat16,
122
+ device="cuda:0",
123
+ model_configs=[
124
+ ModelConfig(path=diffusion_shards),
125
+ ModelConfig(path=os.path.join(base_dir, "models_t5_umt5-xxl-enc-bf16.pth")),
126
+ ModelConfig(path=os.path.join(base_dir, "Wan2.1_VAE.pth")),
127
+ ],
128
+ tokenizer_config=ModelConfig(path=os.path.join(base_dir, "google/umt5-xxl")),
129
+ redirect_common_files=False,
130
+ )
131
+
132
+ # 3. Load ViTeX trained weights on top of base VACE
133
+ pipe.vace.load_state_dict(load_state_dict(ckpt_path), strict=False)
134
+
135
+ # 4. Prepare inputs (see inference_example.py for video loading helper)
136
+ from inference_example import load_video_frames, save_video
137
+ vace_video = load_video_frames("input.mp4", target_frames=121, resize=(720, 1280))
138
+ vace_mask = load_video_frames("input_mask.mp4", target_frames=121, resize=(720, 1280))
139
+ glyph = load_video_frames("glyph.mp4", target_frames=121, resize=(720, 1280))
140
+
141
+ # 5. Run
142
+ out_frames = pipe(
143
+ prompt="Change the sign to read 'HILTON'",
144
+ negative_prompt="",
145
+ vace_video=vace_video,
146
+ vace_video_mask=vace_mask,
147
+ glyph_video=glyph,
148
+ seed=42, height=720, width=1280, num_frames=121,
149
+ cfg_scale=5.0, num_inference_steps=50, tiled=True,
150
+ )
151
+ save_video(out_frames, "output.mp4")
152
+ ```
153
+
154
+ A complete runnable script is provided as `inference_example.py` in this repo.
155
+
156
+ ## Data preparation
157
+
158
+ To produce `glyph_video` from a target text string:
159
+
160
+ 1. Track text-region bounding box per frame (we use TrackAnything / ROMP).
161
+ 2. Render the target string with `cv2.putText` or PIL inside the box on a black background.
162
+ 3. Save as MP4 with the same frame count and resolution as the source video.
163
+
164
+ `vace_video_mask` is a binary per-frame mask of the text region (1 = replace).
165
+ You can produce it from the same tracking + a tight bounding box dilation.
166
+
167
+ The repo's `scripts/render_glyph_tracked.py` and `scripts/prepare_textvace_data.py`
168
+ provide reference implementations.
169
+
170
+ ## Training details
171
+
172
+ - Stage 1 (49 frames @ 720P, 5 epochs, ~22 h): bootstrap on shorter clips
173
+ - Stage 2 (121 frames @ 720P, 2 epochs, ~30 h): fine-tune at full length
174
+ - Optimizer: AdamW, lr=1e-5, weight_decay=1e-2, no LR schedule
175
+ - Grad accumulation: 8, effective batch = 8 GPUs Γ— 8 = 64 micro-batches
176
+ - DeepSpeed ZeRO-3 with both parameter and optimizer state CPU offload
177
+ - Manual activation offload + `--use_gradient_checkpointing_offload`
178
+ - VACE module fully trained; DiT main + T5 + VAE frozen
179
+
180
+ ## Limitations
181
+
182
+ - Trained on 230 samples β€” coverage of artistic fonts, complex backgrounds,
183
+ and non-Latin scripts is limited.
184
+ - Best on planar text (signs, posters); fast-moving or highly distorted text
185
+ may degrade.
186
+ - Inference requires the full 14 B base model β€” no quantized variants released.
187
+ - Single 8 Γ— H100 80 GB inference; no multi-node sharding scripts included.
188
+
189
+ ## Citation
190
+
191
+ ```bibtex
192
+ @misc{vitex2026,
193
+ title = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
194
+ author = {ViTeX Team},
195
+ year = {2026},
196
+ url = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
197
+ }
198
+ ```
199
+
200
+ ## Acknowledgements
201
+
202
+ Built on top of [Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B)
203
+ by the Wan-Video team, and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).