ViTeX-Bench commited on
Commit
9a431ad
Β·
verified Β·
1 Parent(s): 64c8573

Update README to be fully self-contained (v2)

Browse files
Files changed (1) hide show
  1. README.md +111 -83
README.md CHANGED
@@ -1,34 +1,56 @@
1
  ---
2
  license: apache-2.0
3
- base_model: Wan-AI/Wan2.1-VACE-14B
4
  pipeline_tag: text-to-video
5
  tags:
6
  - video-editing
7
  - text-editing
8
  - text-replacement
9
  - diffusion
10
- - wan
11
- - vace
12
  ---
13
 
14
  # ViTeX-14B
15
 
16
- **Vi**deo **Tex**t editing model based on Wan2.1-VACE-14B. Replaces text content
17
- inside a user-provided mask region while preserving the original visual style
18
- (font, color, stroke, shadow, perspective) and the surrounding scene.
 
 
 
 
19
 
20
  | | |
21
  |---|---|
22
- | Base model | [Wan-AI/Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) |
23
  | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
24
  | New modules added | **971 M** (GlyphEncoder + 8 Γ— ConditionCrossAttention) |
25
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
26
  | Resolution | 720 Γ— 1280 |
27
  | Frames | 121 (β‰ˆ 5 s @ 24 fps) |
28
  | Training data | 230 video samples Γ— 10 dataset_repeat |
29
- | Training | 2 epochs (576 optimizer steps), DeepSpeed ZeRO-3 + CPU offload |
 
30
  | Hardware | 8 Γ— NVIDIA H100 80 GB |
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ## Inputs
33
 
34
  For each video to edit, the model needs four things:
@@ -37,7 +59,7 @@ For each video to edit, the model needs four things:
37
  |---|---|---|
38
  | `vace_video` | RGB video, 121 frames @ 720 Γ— 1280 | The original video containing text to replace |
39
  | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
40
- | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (use any font; black bg + white glyphs is fine β€” see [data prep](#data-preparation)) |
41
  | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
42
 
43
  The model outputs a video with the masked region replaced by the target text,
@@ -45,8 +67,9 @@ matching the original style.
45
 
46
  ## Architecture
47
 
48
- Built on top of frozen Wan2.1-VACE-14B (40-layer DiT + 8 VACE blocks).
49
- Two new components are added (both trained from scratch):
 
50
 
51
  ```
52
  target text β†’ render β†’ glyph_video
@@ -59,8 +82,7 @@ target text β†’ render β†’ glyph_video
59
  ↓
60
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
  β”‚ for each VACE block (Γ—8): β”‚
62
- β”‚ Self-Attn (frozen-init, β”‚
63
- β”‚ fine-tuned) β”‚
64
  β”‚ ↓ β”‚
65
  β”‚ Text Cross-Attn (T5) β”‚
66
  β”‚ ↓ β”‚
@@ -75,129 +97,135 @@ target text β†’ render β†’ glyph_video
75
  ```
76
 
77
  The VACE conditioning input (VCU) preserves the **original masked region's
78
- pixels** in the `reactive` channel:
 
 
79
  ```
80
- inactive = VAE(video Γ— (1 βˆ’ mask)) # context outside mask
81
  reactive = VAE(video Γ— mask) # original glyphs inside mask (style cue)
82
  mask = downsample(mask)
83
- VCU = concat(inactive, reactive, mask) # 96 channels
84
  ```
85
- This lets the model see the original text's color/font/stroke and learn to
86
- re-render the new content in the same style.
87
 
88
- ## Installation
 
 
 
89
 
90
- The model uses the modified DiffSynth-Studio repo that introduces the GlyphEncoder
91
- and ConditionCrossAttention modules.
92
 
93
  ```bash
94
- git clone https://github.com/<your-org>/DiffSynth-Studio-TextVACE
95
- cd DiffSynth-Studio-TextVACE
96
- conda create -n vitex python=3.12 -y && conda activate vitex
97
- pip install -e .
98
- pip install accelerate==1.13.0
 
 
 
 
99
  ```
100
 
101
- Required: `torch>=2.7.0+cu128`, NVIDIA GPU with β‰₯ 80 GB VRAM (H100 / A100 80GB).
102
- Inference uses ~ 70 GB VRAM at 720 Γ— 1280 Γ— 121 frames.
 
 
 
103
 
104
  ## Usage
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ```python
107
- from huggingface_hub import snapshot_download
108
- import torch
 
109
  from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
110
  from diffsynth.core import load_state_dict
111
- import glob, os
112
 
113
- # 1. Download base + this model
114
- base_dir = snapshot_download("Wan-AI/Wan2.1-VACE-14B")
115
- vitex_dir = snapshot_download("ViTeX-Bench/ViTeX-14B")
116
- ckpt_path = os.path.join(vitex_dir, "vitex_14b.safetensors")
117
 
118
- # 2. Build pipeline
119
- diffusion_shards = sorted(glob.glob(os.path.join(base_dir, "diffusion_pytorch_model-*.safetensors")))
120
  pipe = WanVideoPipeline.from_pretrained(
121
  torch_dtype=torch.bfloat16,
122
  device="cuda:0",
123
  model_configs=[
124
  ModelConfig(path=diffusion_shards),
125
- ModelConfig(path=os.path.join(base_dir, "models_t5_umt5-xxl-enc-bf16.pth")),
126
- ModelConfig(path=os.path.join(base_dir, "Wan2.1_VAE.pth")),
127
  ],
128
- tokenizer_config=ModelConfig(path=os.path.join(base_dir, "google/umt5-xxl")),
129
  redirect_common_files=False,
130
  )
 
131
 
132
- # 3. Load ViTeX trained weights on top of base VACE
133
- pipe.vace.load_state_dict(load_state_dict(ckpt_path), strict=False)
134
-
135
- # 4. Prepare inputs (see inference_example.py for video loading helper)
136
- from inference_example import load_video_frames, save_video
137
- vace_video = load_video_frames("input.mp4", target_frames=121, resize=(720, 1280))
138
- vace_mask = load_video_frames("input_mask.mp4", target_frames=121, resize=(720, 1280))
139
- glyph = load_video_frames("glyph.mp4", target_frames=121, resize=(720, 1280))
140
-
141
- # 5. Run
142
- out_frames = pipe(
143
- prompt="Change the sign to read 'HILTON'",
144
- negative_prompt="",
145
- vace_video=vace_video,
146
- vace_video_mask=vace_mask,
147
- glyph_video=glyph,
148
- seed=42, height=720, width=1280, num_frames=121,
149
- cfg_scale=5.0, num_inference_steps=50, tiled=True,
150
- )
151
- save_video(out_frames, "output.mp4")
152
  ```
153
 
154
- A complete runnable script is provided as `inference_example.py` in this repo.
 
155
 
156
  ## Data preparation
157
 
158
  To produce `glyph_video` from a target text string:
159
 
160
- 1. Track text-region bounding box per frame (we use TrackAnything / ROMP).
161
- 2. Render the target string with `cv2.putText` or PIL inside the box on a black background.
162
- 3. Save as MP4 with the same frame count and resolution as the source video.
 
163
 
164
- `vace_video_mask` is a binary per-frame mask of the text region (1 = replace).
165
- You can produce it from the same tracking + a tight bounding box dilation.
166
 
167
- The repo's `scripts/render_glyph_tracked.py` and `scripts/prepare_textvace_data.py`
168
- provide reference implementations.
169
 
170
- ## Training details
 
 
 
171
 
172
- - Stage 1 (49 frames @ 720P, 5 epochs, ~22 h): bootstrap on shorter clips
173
- - Stage 2 (121 frames @ 720P, 2 epochs, ~30 h): fine-tune at full length
174
- - Optimizer: AdamW, lr=1e-5, weight_decay=1e-2, no LR schedule
175
- - Grad accumulation: 8, effective batch = 8 GPUs Γ— 8 = 64 micro-batches
176
- - DeepSpeed ZeRO-3 with both parameter and optimizer state CPU offload
177
- - Manual activation offload + `--use_gradient_checkpointing_offload`
178
- - VACE module fully trained; DiT main + T5 + VAE frozen
179
 
180
  ## Limitations
181
 
182
- - Trained on 230 samples β€” coverage of artistic fonts, complex backgrounds,
183
  and non-Latin scripts is limited.
184
  - Best on planar text (signs, posters); fast-moving or highly distorted text
185
  may degrade.
186
- - Inference requires the full 14 B base model β€” no quantized variants released.
187
- - Single 8 Γ— H100 80 GB inference; no multi-node sharding scripts included.
188
 
189
  ## Citation
190
 
191
  ```bibtex
192
  @misc{vitex2026,
193
  title = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
194
- author = {ViTeX Team},
195
  year = {2026},
196
  url = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
197
  }
198
  ```
199
 
200
- ## Acknowledgements
201
 
202
- Built on top of [Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B)
203
- by the Wan-Video team, and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).
 
1
  ---
2
  license: apache-2.0
 
3
  pipeline_tag: text-to-video
4
  tags:
5
  - video-editing
6
  - text-editing
7
  - text-replacement
8
  - diffusion
 
 
9
  ---
10
 
11
  # ViTeX-14B
12
 
13
+ **Vi**deo **Tex**t editing model. Replaces text content inside a user-provided
14
+ mask region of a video while preserving the original visual style (font, color,
15
+ stroke, shadow, perspective) and the surrounding scene.
16
+
17
+ This repository is **fully self-contained** β€” it bundles the trained weights,
18
+ the full base model required for inference, and all custom code needed to run
19
+ it. No external code repositories or third-party model downloads are required.
20
 
21
  | | |
22
  |---|---|
 
23
  | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
24
  | New modules added | **971 M** (GlyphEncoder + 8 Γ— ConditionCrossAttention) |
25
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
26
  | Resolution | 720 Γ— 1280 |
27
  | Frames | 121 (β‰ˆ 5 s @ 24 fps) |
28
  | Training data | 230 video samples Γ— 10 dataset_repeat |
29
+ | Training | **Stage 1**: 5 epochs @ 49 frames (~22 h) β†’ **Stage 2**: 2 epochs @ 121 frames (~30 h) |
30
+ | Optimizer | AdamW lr=1e-5, ZeRO-3 + CPU offload, grad-accum 8 |
31
  | Hardware | 8 Γ— NVIDIA H100 80 GB |
32
 
33
+ ## Repository contents
34
+
35
+ ```
36
+ .
37
+ β”œβ”€β”€ README.md (this file)
38
+ β”œβ”€β”€ requirements.txt (pip dependencies)
39
+ β”œβ”€β”€ inference_example.py (runnable end-to-end inference)
40
+ β”œβ”€β”€ vitex_14b.safetensors (8 GB β€” trained adapter weights)
41
+ β”œβ”€β”€ diffsynth/ (3 MB β€” bundled inference library)
42
+ β”‚ β”œβ”€β”€ pipelines/
43
+ β”‚ β”œβ”€β”€ models/
44
+ β”‚ β”œβ”€β”€ core/
45
+ β”‚ └── ...
46
+ └── base_model/ (70 GB β€” the underlying frozen base model)
47
+ β”œβ”€β”€ config.json
48
+ β”œβ”€β”€ diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
49
+ β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
50
+ β”œβ”€β”€ Wan2.1_VAE.pth
51
+ └── google/umt5-xxl/... (T5 tokenizer)
52
+ ```
53
+
54
  ## Inputs
55
 
56
  For each video to edit, the model needs four things:
 
59
  |---|---|---|
60
  | `vace_video` | RGB video, 121 frames @ 720 Γ— 1280 | The original video containing text to replace |
61
  | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
62
+ | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (any font; black bg + white glyphs is fine) |
63
  | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
64
 
65
  The model outputs a video with the masked region replaced by the target text,
 
67
 
68
  ## Architecture
69
 
70
+ Built on top of a frozen 40-layer DiT video diffusion backbone (the `base_model/`)
71
+ with 8 attached VACE blocks (at layers 0, 5, 10, 15, 20, 25, 30, 35).
72
+ Two new components are introduced and trained from scratch:
73
 
74
  ```
75
  target text β†’ render β†’ glyph_video
 
82
  ↓
83
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
84
  β”‚ for each VACE block (Γ—8): β”‚
85
+ β”‚ Self-Attn (fine-tuned) β”‚
 
86
  β”‚ ↓ β”‚
87
  β”‚ Text Cross-Attn (T5) β”‚
88
  β”‚ ↓ β”‚
 
97
  ```
98
 
99
  The VACE conditioning input (VCU) preserves the **original masked region's
100
+ pixels** in the `reactive` channel so the model can perceive the original
101
+ text style:
102
+
103
  ```
104
+ inactive = VAE(video Γ— (1 βˆ’ mask)) # context outside mask (other text, scene)
105
  reactive = VAE(video Γ— mask) # original glyphs inside mask (style cue)
106
  mask = downsample(mask)
107
+ VCU = concat(inactive, reactive, mask) # 96 channels β†’ VACE blocks
108
  ```
 
 
109
 
110
+ `ConditionCrossAttention.o` and `GlyphEncoder.out_proj` are both
111
+ **zero-initialized**, so training starts from the pretrained behaviour and
112
+ gradually learns to incorporate the glyph signal β€” analogous to the zero-conv
113
+ trick in ControlNet.
114
 
115
+ ## Installation
 
116
 
117
  ```bash
118
+ # 1. Download this whole repository (~78 GB; needs git-lfs)
119
+ git lfs install
120
+ git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
121
+ cd ViTeX-14B
122
+
123
+ # 2. Set up a fresh Python env and install the standard PyPI deps
124
+ conda create -n vitex python=3.12 -y
125
+ conda activate vitex
126
+ pip install -r requirements.txt
127
  ```
128
 
129
+ Hardware requirements:
130
+ - 1 Γ— NVIDIA GPU with **β‰₯ 80 GB VRAM** (H100 / A100 80 GB)
131
+ - ~ 70 GB peak VRAM at 720 Γ— 1280 Γ— 121 frames
132
+ - ~ 250 GB CPU RAM recommended (DiT weights + offloads during loading)
133
+ - ~ 90 GB free disk for repo + workspace
134
 
135
  ## Usage
136
 
137
+ End-to-end inference with the provided script:
138
+
139
+ ```bash
140
+ python inference_example.py \
141
+ --vace_video path/to/source.mp4 \
142
+ --vace_mask path/to/mask.mp4 \
143
+ --glyph_video path/to/target_glyph.mp4 \
144
+ --prompt "Change the sign to read 'HILTON'" \
145
+ --output out.mp4
146
+ ```
147
+
148
+ The script automatically uses the bundled `base_model/` directory and the
149
+ `vitex_14b.safetensors` weights β€” no further downloads needed.
150
+
151
+ Programmatic use:
152
+
153
  ```python
154
+ import sys, os
155
+ sys.path.insert(0, ".") # so `import diffsynth` resolves to bundled lib
156
+ import torch, glob
157
  from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
158
  from diffsynth.core import load_state_dict
 
159
 
160
+ base_dir = "./base_model"
161
+ diffusion_shards = sorted(glob.glob(f"{base_dir}/diffusion_pytorch_model-*.safetensors"))
 
 
162
 
 
 
163
  pipe = WanVideoPipeline.from_pretrained(
164
  torch_dtype=torch.bfloat16,
165
  device="cuda:0",
166
  model_configs=[
167
  ModelConfig(path=diffusion_shards),
168
+ ModelConfig(path=f"{base_dir}/models_t5_umt5-xxl-enc-bf16.pth"),
169
+ ModelConfig(path=f"{base_dir}/Wan2.1_VAE.pth"),
170
  ],
171
+ tokenizer_config=ModelConfig(path=f"{base_dir}/google/umt5-xxl"),
172
  redirect_common_files=False,
173
  )
174
+ pipe.vace.load_state_dict(load_state_dict("./vitex_14b.safetensors"), strict=False)
175
 
176
+ # ... feed in vace_video / vace_video_mask / glyph_video / prompt ...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ```
178
 
179
+ See `inference_example.py` for a complete reference, including video loading
180
+ and saving helpers.
181
 
182
  ## Data preparation
183
 
184
  To produce `glyph_video` from a target text string:
185
 
186
+ 1. Detect / track the text-region bounding box per frame.
187
+ 2. Render the target string with `cv2.putText` or PIL inside the box on a
188
+ black background; export as MP4 with the same frame count and resolution
189
+ as the source.
190
 
191
+ `vace_video_mask` is a binary per-frame mask of the text region (1 = replace);
192
+ typically a tight, slightly dilated box around the tracked region.
193
 
194
+ ## Training summary
 
195
 
196
+ | Stage | Frames | Resolution | Epochs | Wall time | Notes |
197
+ |---|---|---|---|---|---|
198
+ | 1 | 49 | 720 Γ— 1280 | 5 | ~22 h | bootstrap on shorter clips |
199
+ | 2 | 121 | 720 Γ— 1280 | 2 | ~30 h | fine-tune at full length, init from Stage 1 epoch-4 |
200
 
201
+ - 230 video samples, `dataset_repeat=10` β†’ 288 optimizer steps per epoch
202
+ - AdamW, lr 1e-5, weight_decay 1e-2, no LR schedule
203
+ - Gradient accumulation 8, effective batch 64 micro-batches
204
+ - DeepSpeed ZeRO-3 with parameter + optimizer state CPU offload
205
+ - `--use_gradient_checkpointing_offload` (manual activation offload)
206
+ - VACE module fully trained (4.02 B params); base DiT, T5, Wan VAE all frozen
 
207
 
208
  ## Limitations
209
 
210
+ - Trained on 230 samples β€” coverage of artistic fonts, complex backgrounds
211
  and non-Latin scripts is limited.
212
  - Best on planar text (signs, posters); fast-moving or highly distorted text
213
  may degrade.
214
+ - Inference requires the full 14 B base; no quantized variant released.
215
+ - Single-GPU 80 GB inference assumed; multi-node sharding scripts not bundled.
216
 
217
  ## Citation
218
 
219
  ```bibtex
220
  @misc{vitex2026,
221
  title = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
222
+ author = {Anonymous},
223
  year = {2026},
224
  url = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
225
  }
226
  ```
227
 
228
+ ## License
229
 
230
+ Apache-2.0. See `LICENSE.txt` in `base_model/` for the upstream base model
231
+ license; the same license applies to the trained weights and bundled code.