ViTeX-Bench
/

ViTeX-Edit-14B

+---
+license: apache-2.0
+base_model: Wan-AI/Wan2.1-VACE-14B
+pipeline_tag: text-to-video
+tags:
+  - video-editing
+  - text-editing
+  - text-replacement
+  - diffusion
+  - wan
+  - vace
+---
+# ViTeX-14B
+**Vi**deo **Tex**t editing model based on Wan2.1-VACE-14B. Replaces text content
+inside a user-provided mask region while preserving the original visual style
+(font, color, stroke, shadow, perspective) and the surrounding scene.
+|  |  |
+|---|---|
+| Base model | [Wan-AI/Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) |
+| Trainable parameters | **4.02 B** (VACE blocks + new modules) |
+| New modules added | **971 M** (GlyphEncoder + 8 × ConditionCrossAttention) |
+| Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
+| Resolution | 720 × 1280 |
+| Frames | 121 (≈ 5 s @ 24 fps) |
+| Training data | 230 video samples × 10 dataset_repeat |
+| Training | 2 epochs (576 optimizer steps), DeepSpeed ZeRO-3 + CPU offload |
+| Hardware | 8 × NVIDIA H100 80 GB |
+## Inputs
+For each video to edit, the model needs four things:
+| Input | Format | Description |
+|---|---|---|
+| `vace_video` | RGB video, 121 frames @ 720 × 1280 | The original video containing text to replace |
+| `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
+| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (use any font; black bg + white glyphs is fine — see [data prep](#data-preparation)) |
+| `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
+The model outputs a video with the masked region replaced by the target text,
+matching the original style.
+## Architecture
+Built on top of frozen Wan2.1-VACE-14B (40-layer DiT + 8 VACE blocks).
+Two new components are added (both trained from scratch):
+```
+target text → render → glyph_video
+                          ↓
+                     Wan VAE Encoder        ← shared with main video latent
+                          ↓
+                    GlyphEncoder            ← Conv3D patch embed + cross-attn pool to 64 tokens
+                          ↓
+                     glyph tokens (64 × 5120)
+                          ↓
+            ┌─────────────────────────────┐
+            │  for each VACE block (×8):  │
+            │    Self-Attn (frozen-init,  │
+            │             fine-tuned)     │
+            │       ↓                     │
+            │    Text Cross-Attn (T5)     │
+            │       ↓                     │
+            │    FFN                      │
+            │       ↓                     │
+            │ ┌──────────────────────┐    │
+            │ │ ConditionCrossAttn   │ ← K/V from glyph tokens (zero-init at start)
+            │ └──────────────────────┘    │
+            │       ↓ + residual          │
+            │    after_proj → c_skip      │
+            └─────────────────────────────┘
+```
+The VACE conditioning input (VCU) preserves the **original masked region's
+pixels** in the `reactive` channel:
+```
+inactive = VAE(video × (1 − mask))   # context outside mask
+reactive = VAE(video × mask)         # original glyphs inside mask (style cue)
+mask     = downsample(mask)
+VCU      = concat(inactive, reactive, mask)   # 96 channels
+```
+This lets the model see the original text's color/font/stroke and learn to
+re-render the new content in the same style.
+## Installation
+The model uses the modified DiffSynth-Studio repo that introduces the GlyphEncoder
+and ConditionCrossAttention modules.
+```bash
+git clone https://github.com/<your-org>/DiffSynth-Studio-TextVACE
+cd DiffSynth-Studio-TextVACE
+conda create -n vitex python=3.12 -y && conda activate vitex
+pip install -e .
+pip install accelerate==1.13.0
+```
+Required: `torch>=2.7.0+cu128`, NVIDIA GPU with ≥ 80 GB VRAM (H100 / A100 80GB).
+Inference uses ~ 70 GB VRAM at 720 × 1280 × 121 frames.
+## Usage
+```python
+from huggingface_hub import snapshot_download
+import torch
+from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
+from diffsynth.core import load_state_dict
+import glob, os
+# 1. Download base + this model
+base_dir   = snapshot_download("Wan-AI/Wan2.1-VACE-14B")
+vitex_dir  = snapshot_download("ViTeX-Bench/ViTeX-14B")
+ckpt_path  = os.path.join(vitex_dir, "vitex_14b.safetensors")
+# 2. Build pipeline
+diffusion_shards = sorted(glob.glob(os.path.join(base_dir, "diffusion_pytorch_model-*.safetensors")))
+pipe = WanVideoPipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda:0",
+    model_configs=[
+        ModelConfig(path=diffusion_shards),
+        ModelConfig(path=os.path.join(base_dir, "models_t5_umt5-xxl-enc-bf16.pth")),
+        ModelConfig(path=os.path.join(base_dir, "Wan2.1_VAE.pth")),
+    ],
+    tokenizer_config=ModelConfig(path=os.path.join(base_dir, "google/umt5-xxl")),
+    redirect_common_files=False,
+)
+# 3. Load ViTeX trained weights on top of base VACE
+pipe.vace.load_state_dict(load_state_dict(ckpt_path), strict=False)
+# 4. Prepare inputs (see inference_example.py for video loading helper)
+from inference_example import load_video_frames, save_video
+vace_video = load_video_frames("input.mp4",         target_frames=121, resize=(720, 1280))
+vace_mask  = load_video_frames("input_mask.mp4",    target_frames=121, resize=(720, 1280))
+glyph      = load_video_frames("glyph.mp4",         target_frames=121, resize=(720, 1280))
+# 5. Run
+out_frames = pipe(
+    prompt="Change the sign to read 'HILTON'",
+    negative_prompt="",
+    vace_video=vace_video,
+    vace_video_mask=vace_mask,
+    glyph_video=glyph,
+    seed=42, height=720, width=1280, num_frames=121,
+    cfg_scale=5.0, num_inference_steps=50, tiled=True,
+)
+save_video(out_frames, "output.mp4")
+```
+A complete runnable script is provided as `inference_example.py` in this repo.
+## Data preparation
+To produce `glyph_video` from a target text string:
+1. Track text-region bounding box per frame (we use TrackAnything / ROMP).
+2. Render the target string with `cv2.putText` or PIL inside the box on a black background.
+3. Save as MP4 with the same frame count and resolution as the source video.
+`vace_video_mask` is a binary per-frame mask of the text region (1 = replace).
+You can produce it from the same tracking + a tight bounding box dilation.
+The repo's `scripts/render_glyph_tracked.py` and `scripts/prepare_textvace_data.py`
+provide reference implementations.
+## Training details
+- Stage 1 (49 frames @ 720P, 5 epochs, ~22 h): bootstrap on shorter clips
+- Stage 2 (121 frames @ 720P, 2 epochs, ~30 h): fine-tune at full length
+- Optimizer: AdamW, lr=1e-5, weight_decay=1e-2, no LR schedule
+- Grad accumulation: 8, effective batch = 8 GPUs × 8 = 64 micro-batches
+- DeepSpeed ZeRO-3 with both parameter and optimizer state CPU offload
+- Manual activation offload + `--use_gradient_checkpointing_offload`
+- VACE module fully trained; DiT main + T5 + VAE frozen
+## Limitations
+- Trained on 230 samples — coverage of artistic fonts, complex backgrounds,
+  and non-Latin scripts is limited.
+- Best on planar text (signs, posters); fast-moving or highly distorted text
+  may degrade.
+- Inference requires the full 14 B base model — no quantized variants released.
+- Single 8 × H100 80 GB inference; no multi-node sharding scripts included.
+## Citation
+```bibtex
+@misc{vitex2026,
+  title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
+  author = {ViTeX Team},
+  year   = {2026},
+  url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
+}
+```
+## Acknowledgements
+Built on top of [Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B)
+by the Wan-Video team, and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).