ViTeX-14B

🌐 Project page: vitex-bench.github.io β€” qualitative results, leaderboard, and full project overview.

ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.

This repository is fully self-contained β€” it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.

Specs

Trainable parameters 4.02 B (VACE blocks + new modules)
New modules added 971 M (GlyphEncoder + 8 Γ— ConditionCrossAttention)
Total inference params ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B)
Resolution 720 Γ— 1280
Frames 121 (about 5 s at 24 fps)
Training Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h)
Hardware 8 Γ— NVIDIA H100 80 GB

Repository contents

.
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ inference_example.py            run ViTeX-14B on one (video, mask, glyph) tuple
β”œβ”€β”€ make_corp_baseline.py           build the ViTeX-14B (Composite) variant from raw predictions
β”œβ”€β”€ vitex_14b.safetensors           (8 GB β€” trained adapter weights)
β”œβ”€β”€ diffsynth/                      (bundled inference library)
└── base_model/                     (70 GB β€” frozen base model files)
    β”œβ”€β”€ diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
    β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
    β”œβ”€β”€ Wan2.1_VAE.pth
    └── google/umt5-xxl/             (T5 tokenizer)

Inputs

Input Format Description
vace_video RGB video, 121 frames at 720 Γ— 1280 Original video containing text to replace
vace_video_mask grayscale video, same shape Per-frame binary mask: 1 = text region to replace, 0 = preserve
glyph_video RGB video, same shape Pre-rendered glyphs of the target text placed where the mask is
prompt text string The target text itself, e.g. HILTON

Installation

git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
cd ViTeX-14B
conda create -n vitex python=3.12 -y
conda activate vitex
pip install -r requirements.txt

Hardware: 1 Γ— NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 Γ— 1280 Γ— 121 frames.

Usage

python inference_example.py \
    --vace_video   path/to/source.mp4 \
    --vace_mask    path/to/mask.mp4 \
    --glyph_video  path/to/target_glyph.mp4 \
    --prompt       "HILTON" \
    --output       out.mp4

The script automatically uses the bundled base_model/ and vitex_14b.safetensors β€” no extra downloads.

Locality-preserving variant: ViTeX-14B (Composite)

make_corp_baseline.py is a deterministic, training-free post-processing wrapper that composes ViTeX-14B's predicted text region back onto the source video. Two per-frame operations:

  1. Reinhard mean–variance LAB color matching on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
  2. Signed-distance feathered alpha compositing (4-px feather centered on the mask boundary), so the seam is smooth.

Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.

# Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
# and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
python make_corp_baseline.py \
    --records   <data_root>/parsed_records.json \
    --data_root <data_root> \
    --pred_dir  <raw_vitex14b_predictions_dir> \
    --out_dir   <output_dir_for_corp_baseline> \
    --workers   8

CPU-only, runs in ~5 minutes on 8 workers for the 157-clip ViTeX-Bench evaluation split. Requires ffmpeg on PATH.

Reference: appendix G of the ViTeX-Bench paper.

Limitations

  • Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
  • Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
  • Inference requires the full 14 B base; no quantized variant released.

Citation

@misc{vitex2026,
  title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
  author = {Anonymous},
  year   = {2026},
  url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
}

License

Apache-2.0. See base_model/LICENSE.txt for the upstream base model license.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using ViTeX-Bench/ViTeX-14B 1