ViTeX-14B

🌐 Project page: vitex-bench.github.io — qualitative results, leaderboard, and full project overview.

ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.

This repository is fully self-contained — it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.

Specs


Trainable parameters	4.02 B (VACE blocks + new modules)
New modules added	971 M (GlyphEncoder + 8 × ConditionCrossAttention)
Total inference params	~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B)
Resolution	720 × 1280
Frames	121 (about 5 s at 24 fps)
Training	Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h)
Hardware	8 × NVIDIA H100 80 GB

Repository contents

.
├── README.md
├── requirements.txt
├── inference_example.py            run ViTeX-14B on one (video, mask, glyph) tuple
├── make_corp_baseline.py           build the ViTeX-14B (Composite) variant from raw predictions
├── vitex_14b.safetensors           (8 GB — trained adapter weights)
├── diffsynth/                      (bundled inference library)
└── base_model/                     (70 GB — frozen base model files)
    ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
    ├── models_t5_umt5-xxl-enc-bf16.pth
    ├── Wan2.1_VAE.pth
    └── google/umt5-xxl/             (T5 tokenizer)

Inputs

Input	Format	Description
`vace_video`	RGB video, 121 frames at 720 × 1280	Original video containing text to replace
`vace_video_mask`	grayscale video, same shape	Per-frame binary mask: 1 = text region to replace, 0 = preserve
`glyph_video`	RGB video, same shape	Pre-rendered glyphs of the target text placed where the mask is
`prompt`	text string	The target text itself, e.g. `HILTON`

Installation

git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
cd ViTeX-14B
conda create -n vitex python=3.12 -y
conda activate vitex
pip install -r requirements.txt

Hardware: 1 × NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 × 1280 × 121 frames.

Usage

python inference_example.py \
    --vace_video   path/to/source.mp4 \
    --vace_mask    path/to/mask.mp4 \
    --glyph_video  path/to/target_glyph.mp4 \
    --prompt       "HILTON" \
    --output       out.mp4

The script automatically uses the bundled base_model/ and vitex_14b.safetensors — no extra downloads.

Locality-preserving variant: ViTeX-14B (Composite)

make_corp_baseline.py is a deterministic, training-free post-processing wrapper that composes ViTeX-14B's predicted text region back onto the source video. Two per-frame operations:

Reinhard mean–variance LAB color matching on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
Signed-distance feathered alpha compositing (4-px feather centered on the mask boundary), so the seam is smooth.

Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.

# Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
# and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
python make_corp_baseline.py \
    --records   <data_root>/parsed_records.json \
    --data_root <data_root> \
    --pred_dir  <raw_vitex14b_predictions_dir> \
    --out_dir   <output_dir_for_corp_baseline> \
    --workers   8

CPU-only, runs in ~5 minutes on 8 workers for the 157-clip ViTeX-Bench evaluation split. Requires ffmpeg on PATH.

Reference: appendix G of the ViTeX-Bench paper.

Limitations

Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
Inference requires the full 14 B base; no quantized variant released.

Citation

@misc{vitex2026,
  title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
  author = {Anonymous},
  year   = {2026},
  url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
}

License

Apache-2.0. See base_model/LICENSE.txt for the upstream base model license.

Downloads last month: 17

ViTeX-Bench
/

ViTeX-14B