ViTeX-Bench commited on
Commit
0f50cbb
Β·
verified Β·
1 Parent(s): 847e937

Simplify README: drop arch diagram, single-word prompt, plain text

Browse files
Files changed (1) hide show
  1. README.md +24 -156
README.md CHANGED
@@ -10,209 +10,78 @@ tags:
10
 
11
  # ViTeX-14B
12
 
13
- **Vi**deo **Tex**t editing model. Replaces text content inside a user-provided
14
- mask region of a video while preserving the original visual style (font, color,
15
- stroke, shadow, perspective) and the surrounding scene.
16
 
17
- This repository is **fully self-contained** β€” it bundles the trained weights,
18
- the full base model required for inference, and all custom code needed to run
19
- it. No external code repositories or third-party model downloads are required.
20
 
21
  | | |
22
  |---|---|
23
- | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
24
- | New modules added | **971 M** (GlyphEncoder + 8 Γ— ConditionCrossAttention) |
25
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
26
  | Resolution | 720 Γ— 1280 |
27
- | Frames | 121 (β‰ˆ 5 s @ 24 fps) |
28
- | Training data | 230 video samples Γ— 10 dataset_repeat |
29
- | Training | **Stage 1**: 5 epochs @ 49 frames (~22 h) β†’ **Stage 2**: 2 epochs @ 121 frames (~30 h) |
30
- | Optimizer | AdamW lr=1e-5, ZeRO-3 + CPU offload, grad-accum 8 |
31
  | Hardware | 8 Γ— NVIDIA H100 80 GB |
32
 
33
  ## Repository contents
34
 
35
  ```
36
  .
37
- β”œβ”€β”€ README.md (this file)
38
- β”œβ”€β”€ requirements.txt (pip dependencies)
39
- β”œβ”€β”€ inference_example.py (runnable end-to-end inference)
40
  β”œβ”€β”€ vitex_14b.safetensors (8 GB β€” trained adapter weights)
41
- β”œβ”€β”€ diffsynth/ (3 MB β€” bundled inference library)
42
- β”‚ β”œβ”€β”€ pipelines/
43
- β”‚ β”œβ”€β”€ models/
44
- β”‚ β”œβ”€β”€ core/
45
- β”‚ └── ...
46
- └── base_model/ (70 GB β€” the underlying frozen base model)
47
- β”œβ”€β”€ config.json
48
  β”œβ”€β”€ diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
49
  β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
50
  β”œβ”€β”€ Wan2.1_VAE.pth
51
- └── google/umt5-xxl/... (T5 tokenizer)
52
  ```
53
 
54
  ## Inputs
55
 
56
- For each video to edit, the model needs four things:
57
-
58
  | Input | Format | Description |
59
  |---|---|---|
60
- | `vace_video` | RGB video, 121 frames @ 720 Γ— 1280 | The original video containing text to replace |
61
- | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
62
- | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (any font; black bg + white glyphs is fine) |
63
- | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
64
-
65
- The model outputs a video with the masked region replaced by the target text,
66
- matching the original style.
67
-
68
- ## Architecture
69
-
70
- Built on top of a frozen 40-layer DiT video diffusion backbone (the `base_model/`)
71
- with 8 attached VACE blocks (at layers 0, 5, 10, 15, 20, 25, 30, 35).
72
- Two new components are introduced and trained from scratch:
73
-
74
- ```
75
- target text β†’ render β†’ glyph_video
76
- ↓
77
- Wan VAE Encoder ← shared with main video latent
78
- ↓
79
- GlyphEncoder ← Conv3D patch embed + cross-attn pool to 64 tokens
80
- ↓
81
- glyph tokens (64 Γ— 5120)
82
- ↓
83
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
84
- β”‚ for each VACE block (Γ—8): β”‚
85
- β”‚ Self-Attn (fine-tuned) β”‚
86
- β”‚ ↓ β”‚
87
- β”‚ Text Cross-Attn (T5) β”‚
88
- β”‚ ↓ β”‚
89
- β”‚ FFN β”‚
90
- β”‚ ↓ β”‚
91
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
92
- β”‚ β”‚ ConditionCrossAttn β”‚ ← K/V from glyph tokens (zero-init at start)
93
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
94
- β”‚ ↓ + residual β”‚
95
- β”‚ after_proj οΏ½οΏ½ c_skip β”‚
96
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
97
- ```
98
-
99
- The VACE conditioning input (VCU) preserves the **original masked region's
100
- pixels** in the `reactive` channel so the model can perceive the original
101
- text style:
102
-
103
- ```
104
- inactive = VAE(video Γ— (1 βˆ’ mask)) # context outside mask (other text, scene)
105
- reactive = VAE(video Γ— mask) # original glyphs inside mask (style cue)
106
- mask = downsample(mask)
107
- VCU = concat(inactive, reactive, mask) # 96 channels β†’ VACE blocks
108
- ```
109
-
110
- `ConditionCrossAttention.o` and `GlyphEncoder.out_proj` are both
111
- **zero-initialized**, so training starts from the pretrained behaviour and
112
- gradually learns to incorporate the glyph signal β€” analogous to the zero-conv
113
- trick in ControlNet.
114
 
115
  ## Installation
116
 
117
  ```bash
118
- # 1. Download this whole repository (~78 GB; needs git-lfs)
119
  git lfs install
120
  git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
121
  cd ViTeX-14B
122
-
123
- # 2. Set up a fresh Python env and install the standard PyPI deps
124
  conda create -n vitex python=3.12 -y
125
  conda activate vitex
126
  pip install -r requirements.txt
127
  ```
128
 
129
- Hardware requirements:
130
- - 1 Γ— NVIDIA GPU with **β‰₯ 80 GB VRAM** (H100 / A100 80 GB)
131
- - ~ 70 GB peak VRAM at 720 Γ— 1280 Γ— 121 frames
132
- - ~ 250 GB CPU RAM recommended (DiT weights + offloads during loading)
133
- - ~ 90 GB free disk for repo + workspace
134
 
135
  ## Usage
136
 
137
- End-to-end inference with the provided script:
138
-
139
  ```bash
140
  python inference_example.py \
141
  --vace_video path/to/source.mp4 \
142
  --vace_mask path/to/mask.mp4 \
143
  --glyph_video path/to/target_glyph.mp4 \
144
- --prompt "Change the sign to read 'HILTON'" \
145
  --output out.mp4
146
  ```
147
 
148
- The script automatically uses the bundled `base_model/` directory and the
149
- `vitex_14b.safetensors` weights β€” no further downloads needed.
150
-
151
- Programmatic use:
152
-
153
- ```python
154
- import sys, os
155
- sys.path.insert(0, ".") # so `import diffsynth` resolves to bundled lib
156
- import torch, glob
157
- from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
158
- from diffsynth.core import load_state_dict
159
-
160
- base_dir = "./base_model"
161
- diffusion_shards = sorted(glob.glob(f"{base_dir}/diffusion_pytorch_model-*.safetensors"))
162
-
163
- pipe = WanVideoPipeline.from_pretrained(
164
- torch_dtype=torch.bfloat16,
165
- device="cuda:0",
166
- model_configs=[
167
- ModelConfig(path=diffusion_shards),
168
- ModelConfig(path=f"{base_dir}/models_t5_umt5-xxl-enc-bf16.pth"),
169
- ModelConfig(path=f"{base_dir}/Wan2.1_VAE.pth"),
170
- ],
171
- tokenizer_config=ModelConfig(path=f"{base_dir}/google/umt5-xxl"),
172
- redirect_common_files=False,
173
- )
174
- pipe.vace.load_state_dict(load_state_dict("./vitex_14b.safetensors"), strict=False)
175
-
176
- # ... feed in vace_video / vace_video_mask / glyph_video / prompt ...
177
- ```
178
-
179
- See `inference_example.py` for a complete reference, including video loading
180
- and saving helpers.
181
-
182
- ## Data preparation
183
-
184
- To produce `glyph_video` from a target text string:
185
-
186
- 1. Detect / track the text-region bounding box per frame.
187
- 2. Render the target string with `cv2.putText` or PIL inside the box on a
188
- black background; export as MP4 with the same frame count and resolution
189
- as the source.
190
-
191
- `vace_video_mask` is a binary per-frame mask of the text region (1 = replace);
192
- typically a tight, slightly dilated box around the tracked region.
193
-
194
- ## Training summary
195
-
196
- | Stage | Frames | Resolution | Epochs | Wall time | Notes |
197
- |---|---|---|---|---|---|
198
- | 1 | 49 | 720 Γ— 1280 | 5 | ~22 h | bootstrap on shorter clips |
199
- | 2 | 121 | 720 Γ— 1280 | 2 | ~30 h | fine-tune at full length, init from Stage 1 epoch-4 |
200
-
201
- - 230 video samples, `dataset_repeat=10` β†’ 288 optimizer steps per epoch
202
- - AdamW, lr 1e-5, weight_decay 1e-2, no LR schedule
203
- - Gradient accumulation 8, effective batch 64 micro-batches
204
- - DeepSpeed ZeRO-3 with parameter + optimizer state CPU offload
205
- - `--use_gradient_checkpointing_offload` (manual activation offload)
206
- - VACE module fully trained (4.02 B params); base DiT, T5, Wan VAE all frozen
207
 
208
  ## Limitations
209
 
210
- - Trained on 230 samples β€” coverage of artistic fonts, complex backgrounds
211
- and non-Latin scripts is limited.
212
- - Best on planar text (signs, posters); fast-moving or highly distorted text
213
- may degrade.
214
  - Inference requires the full 14 B base; no quantized variant released.
215
- - Single-GPU 80 GB inference assumed; multi-node sharding scripts not bundled.
216
 
217
  ## Citation
218
 
@@ -227,5 +96,4 @@ typically a tight, slightly dilated box around the tracked region.
227
 
228
  ## License
229
 
230
- Apache-2.0. See `LICENSE.txt` in `base_model/` for the upstream base model
231
- license; the same license applies to the trained weights and bundled code.
 
10
 
11
  # ViTeX-14B
12
 
13
+ ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.
 
 
14
 
15
+ This repository is fully self-contained β€” it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.
16
+
17
+ ## Specs
18
 
19
  | | |
20
  |---|---|
21
+ | Trainable parameters | 4.02 B (VACE blocks + new modules) |
22
+ | New modules added | 971 M (GlyphEncoder + 8 Γ— ConditionCrossAttention) |
23
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
24
  | Resolution | 720 Γ— 1280 |
25
+ | Frames | 121 (about 5 s at 24 fps) |
26
+ | Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
 
 
27
  | Hardware | 8 Γ— NVIDIA H100 80 GB |
28
 
29
  ## Repository contents
30
 
31
  ```
32
  .
33
+ β”œβ”€β”€ README.md
34
+ β”œβ”€β”€ requirements.txt
35
+ β”œβ”€β”€ inference_example.py
36
  β”œβ”€β”€ vitex_14b.safetensors (8 GB β€” trained adapter weights)
37
+ β”œβ”€β”€ diffsynth/ (bundled inference library)
38
+ └── base_model/ (70 GB β€” frozen base model files)
 
 
 
 
 
39
  β”œβ”€β”€ diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
40
  β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
41
  β”œβ”€β”€ Wan2.1_VAE.pth
42
+ └── google/umt5-xxl/ (T5 tokenizer)
43
  ```
44
 
45
  ## Inputs
46
 
 
 
47
  | Input | Format | Description |
48
  |---|---|---|
49
+ | `vace_video` | RGB video, 121 frames at 720 Γ— 1280 | Original video containing text to replace |
50
+ | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: 1 = text region to replace, 0 = preserve |
51
+ | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the target text placed where the mask is |
52
+ | `prompt` | text string | The target text itself, e.g. `HILTON` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Installation
55
 
56
  ```bash
 
57
  git lfs install
58
  git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
59
  cd ViTeX-14B
 
 
60
  conda create -n vitex python=3.12 -y
61
  conda activate vitex
62
  pip install -r requirements.txt
63
  ```
64
 
65
+ Hardware: 1 Γ— NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 Γ— 1280 Γ— 121 frames.
 
 
 
 
66
 
67
  ## Usage
68
 
 
 
69
  ```bash
70
  python inference_example.py \
71
  --vace_video path/to/source.mp4 \
72
  --vace_mask path/to/mask.mp4 \
73
  --glyph_video path/to/target_glyph.mp4 \
74
+ --prompt "HILTON" \
75
  --output out.mp4
76
  ```
77
 
78
+ The script automatically uses the bundled `base_model/` and `vitex_14b.safetensors` β€” no extra downloads.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  ## Limitations
81
 
82
+ - Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
83
+ - Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
 
 
84
  - Inference requires the full 14 B base; no quantized variant released.
 
85
 
86
  ## Citation
87
 
 
96
 
97
  ## License
98
 
99
+ Apache-2.0. See `base_model/LICENSE.txt` for the upstream base model license.