ViTeX-Bench
/

ViTeX-Edit-14B

@@ -1,34 +1,56 @@
 ---
 license: apache-2.0
-base_model: Wan-AI/Wan2.1-VACE-14B
 pipeline_tag: text-to-video
 tags:
   - video-editing
   - text-editing
   - text-replacement
   - diffusion
-  - wan
-  - vace
 ---
 # ViTeX-14B
-**Vi**deo **Tex**t editing model based on Wan2.1-VACE-14B. Replaces text content
-inside a user-provided mask region while preserving the original visual style
-(font, color, stroke, shadow, perspective) and the surrounding scene.
 |  |  |
 |---|---|
-| Base model | [Wan-AI/Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) |
 | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
 | New modules added | **971 M** (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
 | Resolution | 720 × 1280 |
 | Frames | 121 (≈ 5 s @ 24 fps) |
 | Training data | 230 video samples × 10 dataset_repeat |
-| Training | 2 epochs (576 optimizer steps), DeepSpeed ZeRO-3 + CPU offload |
 | Hardware | 8 × NVIDIA H100 80 GB |
 ## Inputs
 For each video to edit, the model needs four things:
@@ -37,7 +59,7 @@ For each video to edit, the model needs four things:
 |---|---|---|
 | `vace_video` | RGB video, 121 frames @ 720 × 1280 | The original video containing text to replace |
 | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
-| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (use any font; black bg + white glyphs is fine — see [data prep](#data-preparation)) |
 | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
 The model outputs a video with the masked region replaced by the target text,
@@ -45,8 +67,9 @@ matching the original style.
 ## Architecture
-Built on top of frozen Wan2.1-VACE-14B (40-layer DiT + 8 VACE blocks).
-Two new components are added (both trained from scratch):
 ```
 target text → render → glyph_video
@@ -59,8 +82,7 @@ target text → render → glyph_video
                           ↓
             ┌─────────────────────────────┐
             │  for each VACE block (×8):  │
-            │    Self-Attn (frozen-init,  │
-            │             fine-tuned)     │
             │       ↓                     │
             │    Text Cross-Attn (T5)     │
             │       ↓                     │
@@ -75,129 +97,135 @@ target text → render → glyph_video
 ```
 The VACE conditioning input (VCU) preserves the **original masked region's
-pixels** in the `reactive` channel:
 ```
-inactive = VAE(video × (1 − mask))   # context outside mask
 reactive = VAE(video × mask)         # original glyphs inside mask (style cue)
 mask     = downsample(mask)
-VCU      = concat(inactive, reactive, mask)   # 96 channels
 ```
-This lets the model see the original text's color/font/stroke and learn to
-re-render the new content in the same style.
-## Installation
-The model uses the modified DiffSynth-Studio repo that introduces the GlyphEncoder
-and ConditionCrossAttention modules.
 ```bash
-git clone https://github.com/<your-org>/DiffSynth-Studio-TextVACE
-cd DiffSynth-Studio-TextVACE
-conda create -n vitex python=3.12 -y && conda activate vitex
-pip install -e .
-pip install accelerate==1.13.0
 ```
-Required: `torch>=2.7.0+cu128`, NVIDIA GPU with ≥ 80 GB VRAM (H100 / A100 80GB).
-Inference uses ~ 70 GB VRAM at 720 × 1280 × 121 frames.
 ## Usage
 ```python
-from huggingface_hub import snapshot_download
-import torch
 from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
 from diffsynth.core import load_state_dict
-import glob, os
-# 1. Download base + this model
-base_dir   = snapshot_download("Wan-AI/Wan2.1-VACE-14B")
-vitex_dir  = snapshot_download("ViTeX-Bench/ViTeX-14B")
-ckpt_path  = os.path.join(vitex_dir, "vitex_14b.safetensors")
-# 2. Build pipeline
-diffusion_shards = sorted(glob.glob(os.path.join(base_dir, "diffusion_pytorch_model-*.safetensors")))
 pipe = WanVideoPipeline.from_pretrained(
     torch_dtype=torch.bfloat16,
     device="cuda:0",
     model_configs=[
         ModelConfig(path=diffusion_shards),
-        ModelConfig(path=os.path.join(base_dir, "models_t5_umt5-xxl-enc-bf16.pth")),
-        ModelConfig(path=os.path.join(base_dir, "Wan2.1_VAE.pth")),
     ],
-    tokenizer_config=ModelConfig(path=os.path.join(base_dir, "google/umt5-xxl")),
     redirect_common_files=False,
 )
-# 3. Load ViTeX trained weights on top of base VACE
-pipe.vace.load_state_dict(load_state_dict(ckpt_path), strict=False)
-# 4. Prepare inputs (see inference_example.py for video loading helper)
-from inference_example import load_video_frames, save_video
-vace_video = load_video_frames("input.mp4",         target_frames=121, resize=(720, 1280))
-vace_mask  = load_video_frames("input_mask.mp4",    target_frames=121, resize=(720, 1280))
-glyph      = load_video_frames("glyph.mp4",         target_frames=121, resize=(720, 1280))
-# 5. Run
-out_frames = pipe(
-    prompt="Change the sign to read 'HILTON'",
-    negative_prompt="",
-    vace_video=vace_video,
-    vace_video_mask=vace_mask,
-    glyph_video=glyph,
-    seed=42, height=720, width=1280, num_frames=121,
-    cfg_scale=5.0, num_inference_steps=50, tiled=True,
-)
-save_video(out_frames, "output.mp4")
 ```
-A complete runnable script is provided as `inference_example.py` in this repo.
 ## Data preparation
 To produce `glyph_video` from a target text string:
-1. Track text-region bounding box per frame (we use TrackAnything / ROMP).
-2. Render the target string with `cv2.putText` or PIL inside the box on a black background.
-3. Save as MP4 with the same frame count and resolution as the source video.
-`vace_video_mask` is a binary per-frame mask of the text region (1 = replace).
-You can produce it from the same tracking + a tight bounding box dilation.
-The repo's `scripts/render_glyph_tracked.py` and `scripts/prepare_textvace_data.py`
-provide reference implementations.
-## Training details
-- Stage 1 (49 frames @ 720P, 5 epochs, ~22 h): bootstrap on shorter clips
-- Stage 2 (121 frames @ 720P, 2 epochs, ~30 h): fine-tune at full length
-- Optimizer: AdamW, lr=1e-5, weight_decay=1e-2, no LR schedule
-- Grad accumulation: 8, effective batch = 8 GPUs × 8 = 64 micro-batches
-- DeepSpeed ZeRO-3 with both parameter and optimizer state CPU offload
-- Manual activation offload + `--use_gradient_checkpointing_offload`
-- VACE module fully trained; DiT main + T5 + VAE frozen
 ## Limitations
-- Trained on 230 samples — coverage of artistic fonts, complex backgrounds,
   and non-Latin scripts is limited.
 - Best on planar text (signs, posters); fast-moving or highly distorted text
   may degrade.
-- Inference requires the full 14 B base model — no quantized variants released.
-- Single 8 × H100 80 GB inference; no multi-node sharding scripts included.
 ## Citation
 ```bibtex
 @misc{vitex2026,
   title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
-  author = {ViTeX Team},
   year   = {2026},
   url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
 }
 ```
-## Acknowledgements
-Built on top of [Wan2.1-VACE-14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B)
-by the Wan-Video team, and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).

 ---
 license: apache-2.0
 pipeline_tag: text-to-video
 tags:
   - video-editing
   - text-editing
   - text-replacement
   - diffusion
 ---
 # ViTeX-14B
+**Vi**deo **Tex**t editing model. Replaces text content inside a user-provided
+mask region of a video while preserving the original visual style (font, color,
+stroke, shadow, perspective) and the surrounding scene.
+This repository is **fully self-contained** — it bundles the trained weights,
+the full base model required for inference, and all custom code needed to run
+it. No external code repositories or third-party model downloads are required.
 |  |  |
 |---|---|
 | Trainable parameters | **4.02 B** (VACE blocks + new modules) |
 | New modules added | **971 M** (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
 | Resolution | 720 × 1280 |
 | Frames | 121 (≈ 5 s @ 24 fps) |
 | Training data | 230 video samples × 10 dataset_repeat |
+| Training | **Stage 1**: 5 epochs @ 49 frames (~22 h) → **Stage 2**: 2 epochs @ 121 frames (~30 h) |
+| Optimizer | AdamW lr=1e-5, ZeRO-3 + CPU offload, grad-accum 8 |
 | Hardware | 8 × NVIDIA H100 80 GB |
+## Repository contents
+```
+.
+├── README.md                       (this file)
+├── requirements.txt                (pip dependencies)
+├── inference_example.py            (runnable end-to-end inference)
+├── vitex_14b.safetensors           (8 GB — trained adapter weights)
+├── diffsynth/                      (3 MB — bundled inference library)
+│   ├── pipelines/
+│   ├── models/
+│   ├── core/
+│   └── ...
+└── base_model/                     (70 GB — the underlying frozen base model)
+    ├── config.json
+    ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
+    ├── models_t5_umt5-xxl-enc-bf16.pth
+    ├── Wan2.1_VAE.pth
+    └── google/umt5-xxl/...         (T5 tokenizer)
+```
 ## Inputs
 For each video to edit, the model needs four things:
 |---|---|---|
 | `vace_video` | RGB video, 121 frames @ 720 × 1280 | The original video containing text to replace |
 | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
+| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (any font; black bg + white glyphs is fine) |
 | `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
 The model outputs a video with the masked region replaced by the target text,
 ## Architecture
+Built on top of a frozen 40-layer DiT video diffusion backbone (the `base_model/`)
+with 8 attached VACE blocks (at layers 0, 5, 10, 15, 20, 25, 30, 35).
+Two new components are introduced and trained from scratch:
 ```
 target text → render → glyph_video
                           ↓
             ┌─────────────────────────────┐
             │  for each VACE block (×8):  │
+            │    Self-Attn  (fine-tuned)  │
             │       ↓                     │
             │    Text Cross-Attn (T5)     │
             │       ↓                     │
 ```
 The VACE conditioning input (VCU) preserves the **original masked region's
+pixels** in the `reactive` channel so the model can perceive the original
+text style:
 ```
+inactive = VAE(video × (1 − mask))   # context outside mask (other text, scene)
 reactive = VAE(video × mask)         # original glyphs inside mask (style cue)
 mask     = downsample(mask)
+VCU      = concat(inactive, reactive, mask)   # 96 channels → VACE blocks
 ```
+`ConditionCrossAttention.o` and `GlyphEncoder.out_proj` are both
+**zero-initialized**, so training starts from the pretrained behaviour and
+gradually learns to incorporate the glyph signal — analogous to the zero-conv
+trick in ControlNet.
+## Installation
 ```bash
+# 1. Download this whole repository (~78 GB; needs git-lfs)
+git lfs install
+git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
+cd ViTeX-14B
+# 2. Set up a fresh Python env and install the standard PyPI deps
+conda create -n vitex python=3.12 -y
+conda activate vitex
+pip install -r requirements.txt
 ```
+Hardware requirements:
+- 1 × NVIDIA GPU with **≥ 80 GB VRAM** (H100 / A100 80 GB)
+- ~ 70 GB peak VRAM at 720 × 1280 × 121 frames
+- ~ 250 GB CPU RAM recommended (DiT weights + offloads during loading)
+- ~ 90 GB free disk for repo + workspace
 ## Usage
+End-to-end inference with the provided script:
+```bash
+python inference_example.py \
+    --vace_video   path/to/source.mp4 \
+    --vace_mask    path/to/mask.mp4 \
+    --glyph_video  path/to/target_glyph.mp4 \
+    --prompt       "Change the sign to read 'HILTON'" \
+    --output       out.mp4
+```
+The script automatically uses the bundled `base_model/` directory and the
+`vitex_14b.safetensors` weights — no further downloads needed.
+Programmatic use:
 ```python
+import sys, os
+sys.path.insert(0, ".")          # so `import diffsynth` resolves to bundled lib
+import torch, glob
 from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
 from diffsynth.core import load_state_dict
+base_dir = "./base_model"
+diffusion_shards = sorted(glob.glob(f"{base_dir}/diffusion_pytorch_model-*.safetensors"))
 pipe = WanVideoPipeline.from_pretrained(
     torch_dtype=torch.bfloat16,
     device="cuda:0",
     model_configs=[
         ModelConfig(path=diffusion_shards),
+        ModelConfig(path=f"{base_dir}/models_t5_umt5-xxl-enc-bf16.pth"),
+        ModelConfig(path=f"{base_dir}/Wan2.1_VAE.pth"),
     ],
+    tokenizer_config=ModelConfig(path=f"{base_dir}/google/umt5-xxl"),
     redirect_common_files=False,
 )
+pipe.vace.load_state_dict(load_state_dict("./vitex_14b.safetensors"), strict=False)
+# ... feed in vace_video / vace_video_mask / glyph_video / prompt ...
 ```
+See `inference_example.py` for a complete reference, including video loading
+and saving helpers.
 ## Data preparation
 To produce `glyph_video` from a target text string:
+1. Detect / track the text-region bounding box per frame.
+2. Render the target string with `cv2.putText` or PIL inside the box on a
+   black background; export as MP4 with the same frame count and resolution
+   as the source.
+`vace_video_mask` is a binary per-frame mask of the text region (1 = replace);
+typically a tight, slightly dilated box around the tracked region.
+## Training summary
+| Stage | Frames | Resolution | Epochs | Wall time | Notes |
+|---|---|---|---|---|---|
+| 1 | 49 | 720 × 1280 | 5 | ~22 h | bootstrap on shorter clips |
+| 2 | 121 | 720 × 1280 | 2 | ~30 h | fine-tune at full length, init from Stage 1 epoch-4 |
+- 230 video samples, `dataset_repeat=10` → 288 optimizer steps per epoch
+- AdamW, lr 1e-5, weight_decay 1e-2, no LR schedule
+- Gradient accumulation 8, effective batch 64 micro-batches
+- DeepSpeed ZeRO-3 with parameter + optimizer state CPU offload
+- `--use_gradient_checkpointing_offload` (manual activation offload)
+- VACE module fully trained (4.02 B params); base DiT, T5, Wan VAE all frozen
 ## Limitations
+- Trained on 230 samples — coverage of artistic fonts, complex backgrounds
   and non-Latin scripts is limited.
 - Best on planar text (signs, posters); fast-moving or highly distorted text
   may degrade.
+- Inference requires the full 14 B base; no quantized variant released.
+- Single-GPU 80 GB inference assumed; multi-node sharding scripts not bundled.
 ## Citation
 ```bibtex
 @misc{vitex2026,
   title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
+  author = {Anonymous},
   year   = {2026},
   url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
 }
 ```
+## License
+Apache-2.0. See `LICENSE.txt` in `base_model/` for the upstream base model
+license; the same license applies to the trained weights and bundled code.