ViTeX-14B
π Project page: vitex-bench.github.io β qualitative results, leaderboard, and full project overview.
ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.
This repository is fully self-contained β it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.
Specs
| Trainable parameters | 4.02 B (VACE blocks + new modules) |
| New modules added | 971 M (GlyphEncoder + 8 Γ ConditionCrossAttention) |
| Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
| Resolution | 720 Γ 1280 |
| Frames | 121 (about 5 s at 24 fps) |
| Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
| Hardware | 8 Γ NVIDIA H100 80 GB |
Repository contents
.
βββ README.md
βββ requirements.txt
βββ inference_example.py run ViTeX-14B on one (video, mask, glyph) tuple
βββ make_corp_baseline.py build the ViTeX-14B (Composite) variant from raw predictions
βββ vitex_14b.safetensors (8 GB β trained adapter weights)
βββ diffsynth/ (bundled inference library)
βββ base_model/ (70 GB β frozen base model files)
βββ diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
βββ models_t5_umt5-xxl-enc-bf16.pth
βββ Wan2.1_VAE.pth
βββ google/umt5-xxl/ (T5 tokenizer)
Inputs
| Input | Format | Description |
|---|---|---|
vace_video |
RGB video, 121 frames at 720 Γ 1280 | Original video containing text to replace |
vace_video_mask |
grayscale video, same shape | Per-frame binary mask: 1 = text region to replace, 0 = preserve |
glyph_video |
RGB video, same shape | Pre-rendered glyphs of the target text placed where the mask is |
prompt |
text string | The target text itself, e.g. HILTON |
Installation
git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
cd ViTeX-14B
conda create -n vitex python=3.12 -y
conda activate vitex
pip install -r requirements.txt
Hardware: 1 Γ NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 Γ 1280 Γ 121 frames.
Usage
python inference_example.py \
--vace_video path/to/source.mp4 \
--vace_mask path/to/mask.mp4 \
--glyph_video path/to/target_glyph.mp4 \
--prompt "HILTON" \
--output out.mp4
The script automatically uses the bundled base_model/ and vitex_14b.safetensors β no extra downloads.
Locality-preserving variant: ViTeX-14B (Composite)
make_corp_baseline.py is a deterministic, training-free post-processing wrapper that composes ViTeX-14B's predicted text region back onto the source video. Two per-frame operations:
- Reinhard meanβvariance LAB color matching on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
- Signed-distance feathered alpha compositing (4-px feather centered on the mask boundary), so the seam is smooth.
Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.
# Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
# and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
python make_corp_baseline.py \
--records <data_root>/parsed_records.json \
--data_root <data_root> \
--pred_dir <raw_vitex14b_predictions_dir> \
--out_dir <output_dir_for_corp_baseline> \
--workers 8
CPU-only, runs in ~5 minutes on 8 workers for the 157-clip ViTeX-Bench evaluation split. Requires ffmpeg on PATH.
Reference: appendix G of the ViTeX-Bench paper.
Limitations
- Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
- Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
- Inference requires the full 14 B base; no quantized variant released.
Citation
@misc{vitex2026,
title = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
author = {Anonymous},
year = {2026},
url = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
}
License
Apache-2.0. See base_model/LICENSE.txt for the upstream base model license.
- Downloads last month
- 17