ViTeX-Bench
/

ViTeX-Edit-14B

@@ -10,209 +10,78 @@ tags:
 # ViTeX-14B
-**Vi**deo **Tex**t editing model. Replaces text content inside a user-provided
-mask region of a video while preserving the original visual style (font, color,
-stroke, shadow, perspective) and the surrounding scene.
-This repository is **fully self-contained** — it bundles the trained weights,
-the full base model required for inference, and all custom code needed to run
-it. No external code repositories or third-party model downloads are required.
 |  |  |
 |---|---|
-| Trainable parameters | **4.02 B** (VACE blocks + new modules) |
-| New modules added | **971 M** (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
 | Resolution | 720 × 1280 |
-| Frames | 121 (≈ 5 s @ 24 fps) |
-| Training data | 230 video samples × 10 dataset_repeat |
-| Training | **Stage 1**: 5 epochs @ 49 frames (~22 h) → **Stage 2**: 2 epochs @ 121 frames (~30 h) |
-| Optimizer | AdamW lr=1e-5, ZeRO-3 + CPU offload, grad-accum 8 |
 | Hardware | 8 × NVIDIA H100 80 GB |
 ## Repository contents
 ```
 .
-├── README.md                       (this file)
-├── requirements.txt                (pip dependencies)
-├── inference_example.py            (runnable end-to-end inference)
 ├── vitex_14b.safetensors           (8 GB — trained adapter weights)
-├── diffsynth/                      (3 MB — bundled inference library)
-│   ├── pipelines/
-│   ├── models/
-│   ├── core/
-│   └── ...
-└── base_model/                     (70 GB — the underlying frozen base model)
-    ├── config.json
     ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
     ├── models_t5_umt5-xxl-enc-bf16.pth
     ├── Wan2.1_VAE.pth
-    └── google/umt5-xxl/...         (T5 tokenizer)
 ```
 ## Inputs
-For each video to edit, the model needs four things:
 | Input | Format | Description |
 |---|---|---|
-| `vace_video` | RGB video, 121 frames @ 720 × 1280 | The original video containing text to replace |
-| `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: `1` = text region to replace, `0` = preserve |
-| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the **target text** placed where the mask is (any font; black bg + white glyphs is fine) |
-| `prompt` | text string | Optional natural-language description (e.g. "Change the storefront sign to read 'Hilton'") |
-The model outputs a video with the masked region replaced by the target text,
-matching the original style.
-## Architecture
-Built on top of a frozen 40-layer DiT video diffusion backbone (the `base_model/`)
-with 8 attached VACE blocks (at layers 0, 5, 10, 15, 20, 25, 30, 35).
-Two new components are introduced and trained from scratch:
-```
-target text → render → glyph_video
-                          ↓
-                     Wan VAE Encoder        ← shared with main video latent
-                          ↓
-                    GlyphEncoder            ← Conv3D patch embed + cross-attn pool to 64 tokens
-                          ↓
-                     glyph tokens (64 × 5120)
-                          ↓
-            ┌─────────────────────────────┐
-            │  for each VACE block (×8):  │
-            │    Self-Attn  (fine-tuned)  │
-            │       ↓                     │
-            │    Text Cross-Attn (T5)     │
-            │       ↓                     │
-            │    FFN                      │
-            │       ↓                     │
-            │ ┌──────────────────────┐    │
-            │ │ ConditionCrossAttn   │ ← K/V from glyph tokens (zero-init at start)
-            │ └──────────────────────┘    │
-            │       ↓ + residual          │
-            │    after_proj �� c_skip      │
-            └─────────────────────────────┘
-```
-The VACE conditioning input (VCU) preserves the **original masked region's
-pixels** in the `reactive` channel so the model can perceive the original
-text style:
-```
-inactive = VAE(video × (1 − mask))   # context outside mask (other text, scene)
-reactive = VAE(video × mask)         # original glyphs inside mask (style cue)
-mask     = downsample(mask)
-VCU      = concat(inactive, reactive, mask)   # 96 channels → VACE blocks
-```
-`ConditionCrossAttention.o` and `GlyphEncoder.out_proj` are both
-**zero-initialized**, so training starts from the pretrained behaviour and
-gradually learns to incorporate the glyph signal — analogous to the zero-conv
-trick in ControlNet.
 ## Installation
 ```bash
-# 1. Download this whole repository (~78 GB; needs git-lfs)
 git lfs install
 git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
 cd ViTeX-14B
-# 2. Set up a fresh Python env and install the standard PyPI deps
 conda create -n vitex python=3.12 -y
 conda activate vitex
 pip install -r requirements.txt
 ```
-Hardware requirements:
-- 1 × NVIDIA GPU with **≥ 80 GB VRAM** (H100 / A100 80 GB)
-- ~ 70 GB peak VRAM at 720 × 1280 × 121 frames
-- ~ 250 GB CPU RAM recommended (DiT weights + offloads during loading)
-- ~ 90 GB free disk for repo + workspace
 ## Usage
-End-to-end inference with the provided script:
 ```bash
 python inference_example.py \
     --vace_video   path/to/source.mp4 \
     --vace_mask    path/to/mask.mp4 \
     --glyph_video  path/to/target_glyph.mp4 \
-    --prompt       "Change the sign to read 'HILTON'" \
     --output       out.mp4
 ```
-The script automatically uses the bundled `base_model/` directory and the
-`vitex_14b.safetensors` weights — no further downloads needed.
-Programmatic use:
-```python
-import sys, os
-sys.path.insert(0, ".")          # so `import diffsynth` resolves to bundled lib
-import torch, glob
-from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
-from diffsynth.core import load_state_dict
-base_dir = "./base_model"
-diffusion_shards = sorted(glob.glob(f"{base_dir}/diffusion_pytorch_model-*.safetensors"))
-pipe = WanVideoPipeline.from_pretrained(
-    torch_dtype=torch.bfloat16,
-    device="cuda:0",
-    model_configs=[
-        ModelConfig(path=diffusion_shards),
-        ModelConfig(path=f"{base_dir}/models_t5_umt5-xxl-enc-bf16.pth"),
-        ModelConfig(path=f"{base_dir}/Wan2.1_VAE.pth"),
-    ],
-    tokenizer_config=ModelConfig(path=f"{base_dir}/google/umt5-xxl"),
-    redirect_common_files=False,
-)
-pipe.vace.load_state_dict(load_state_dict("./vitex_14b.safetensors"), strict=False)
-# ... feed in vace_video / vace_video_mask / glyph_video / prompt ...
-```
-See `inference_example.py` for a complete reference, including video loading
-and saving helpers.
-## Data preparation
-To produce `glyph_video` from a target text string:
-1. Detect / track the text-region bounding box per frame.
-2. Render the target string with `cv2.putText` or PIL inside the box on a
-   black background; export as MP4 with the same frame count and resolution
-   as the source.
-`vace_video_mask` is a binary per-frame mask of the text region (1 = replace);
-typically a tight, slightly dilated box around the tracked region.
-## Training summary
-| Stage | Frames | Resolution | Epochs | Wall time | Notes |
-|---|---|---|---|---|---|
-| 1 | 49 | 720 × 1280 | 5 | ~22 h | bootstrap on shorter clips |
-| 2 | 121 | 720 × 1280 | 2 | ~30 h | fine-tune at full length, init from Stage 1 epoch-4 |
-- 230 video samples, `dataset_repeat=10` → 288 optimizer steps per epoch
-- AdamW, lr 1e-5, weight_decay 1e-2, no LR schedule
-- Gradient accumulation 8, effective batch 64 micro-batches
-- DeepSpeed ZeRO-3 with parameter + optimizer state CPU offload
-- `--use_gradient_checkpointing_offload` (manual activation offload)
-- VACE module fully trained (4.02 B params); base DiT, T5, Wan VAE all frozen
 ## Limitations
-- Trained on 230 samples — coverage of artistic fonts, complex backgrounds
-  and non-Latin scripts is limited.
-- Best on planar text (signs, posters); fast-moving or highly distorted text
-  may degrade.
 - Inference requires the full 14 B base; no quantized variant released.
-- Single-GPU 80 GB inference assumed; multi-node sharding scripts not bundled.
 ## Citation
@@ -227,5 +96,4 @@ typically a tight, slightly dilated box around the tracked region.
 ## License
-Apache-2.0. See `LICENSE.txt` in `base_model/` for the upstream base model
-license; the same license applies to the trained weights and bundled code.

 # ViTeX-14B
+ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.
+This repository is fully self-contained — it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.
+## Specs
 |  |  |
 |---|---|
+| Trainable parameters | 4.02 B (VACE blocks + new modules) |
+| New modules added | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
 | Resolution | 720 × 1280 |
+| Frames | 121 (about 5 s at 24 fps) |
+| Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
 | Hardware | 8 × NVIDIA H100 80 GB |
 ## Repository contents
 ```
 .
+├── README.md
+├── requirements.txt
+├── inference_example.py
 ├── vitex_14b.safetensors           (8 GB — trained adapter weights)
+├── diffsynth/                      (bundled inference library)
+└── base_model/                     (70 GB — frozen base model files)
     ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
     ├── models_t5_umt5-xxl-enc-bf16.pth
     ├── Wan2.1_VAE.pth
+    └── google/umt5-xxl/             (T5 tokenizer)
 ```
 ## Inputs
 | Input | Format | Description |
 |---|---|---|
+| `vace_video` | RGB video, 121 frames at 720 × 1280 | Original video containing text to replace |
+| `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: 1 = text region to replace, 0 = preserve |
+| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the target text placed where the mask is |
+| `prompt` | text string | The target text itself, e.g. `HILTON` |
 ## Installation
 ```bash
 git lfs install
 git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
 cd ViTeX-14B
 conda create -n vitex python=3.12 -y
 conda activate vitex
 pip install -r requirements.txt
 ```
+Hardware: 1 × NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 × 1280 × 121 frames.
 ## Usage
 ```bash
 python inference_example.py \
     --vace_video   path/to/source.mp4 \
     --vace_mask    path/to/mask.mp4 \
     --glyph_video  path/to/target_glyph.mp4 \
+    --prompt       "HILTON" \
     --output       out.mp4
 ```
+The script automatically uses the bundled `base_model/` and `vitex_14b.safetensors` — no extra downloads.
 ## Limitations
+- Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
+- Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
 - Inference requires the full 14 B base; no quantized variant released.
 ## Citation
 ## License
+Apache-2.0. See `base_model/LICENSE.txt` for the upstream base model license.