ViTeX-Edit-14B (Model & Inference code)
🌐 Project page · 📊 Dataset · 🧪 Benchmark code · 🤖 Model & Inference code · 🏆 Leaderboard
Open reference model for video scene text editing. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the ViTeX-Dataset 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.
Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.
Repository
.
├── inference_example.py run ViTeX-Edit-14B on one (video, mask, glyph) tuple
├── make_corp_baseline.py build the ViTeX-Edit-14B (Composite) variant
├── vitex_14b.safetensors (8 GB, trained adapter weights)
├── diffsynth/ bundled inference library
└── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE)
Self-contained: no extra clones or downloads needed.
Inputs
| Input | Format |
|---|---|
vace_video |
RGB, 720 × 1280, 121 frames — source video |
vace_mask |
grayscale, same shape — 1 = text region to replace |
glyph_video |
RGB, same shape — pre-rendered target-text glyphs warped along source motion |
prompt |
text string — the target text |
Usage
git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B
conda create -n vitex python=3.12 -y && conda activate vitex
pip install -r requirements.txt
python inference_example.py \
--vace_video path/to/source.mp4 \
--vace_mask path/to/mask.mp4 \
--glyph_video path/to/target_glyph.mp4 \
--prompt "HILTON" \
--output out.mp4
Locality-preserving variant: ViTeX-Edit-14B (Composite)
make_corp_baseline.py is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B.
python make_corp_baseline.py \
--records <data_root>/parsed_records.json \
--data_root <data_root> \
--pred_dir <raw_vitex14b_predictions_dir> \
--out_dir <output_dir_for_composite_baseline> \
--workers 8
License
Apache-2.0 (this code and adapter weights). See base_model/LICENSE.txt for the upstream base-model license.
Citation
@misc{vitex2026,
title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
author = {Anonymous},
year = {2026},
note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
}
- Downloads last month
- 25