--- license: apache-2.0 pipeline_tag: text-to-video tags: - video-editing - text-editing - text-replacement - diffusion --- # ViTeX-Edit-14B (Model & Inference code) 🌐 [Project page](https://vitex-bench.github.io/)  Β·  πŸ“Š [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset)  Β·  πŸ§ͺ [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench)  Β·  πŸ€– Model & Inference code  Β·  πŸ† [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard) Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene. > Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization. ## Repository ``` . β”œβ”€β”€ inference_example.py run ViTeX-Edit-14B on one (video, mask, glyph) tuple β”œβ”€β”€ make_corp_baseline.py build the ViTeX-Edit-14B (Composite) variant β”œβ”€β”€ vitex_14b.safetensors (8 GB, trained adapter weights) β”œβ”€β”€ diffsynth/ bundled inference library └── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE) ``` Self-contained: no extra clones or downloads needed. ## Inputs | Input | Format | |---|---| | `vace_video` | RGB, 720 Γ— 1280, 121 frames β€” source video | | `vace_mask` | grayscale, same shape β€” `1 = text region to replace` | | `glyph_video` | RGB, same shape β€” pre-rendered target-text glyphs warped along source motion | | `prompt` | text string β€” the target text | ## Usage ```bash git lfs install git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B conda create -n vitex python=3.12 -y && conda activate vitex pip install -r requirements.txt python inference_example.py \ --vace_video path/to/source.mp4 \ --vace_mask path/to/mask.mp4 \ --glyph_video path/to/target_glyph.mp4 \ --prompt "HILTON" \ --output out.mp4 ``` ## Locality-preserving variant: ViTeX-Edit-14B (Composite) `make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B. ```bash python make_corp_baseline.py \ --records /parsed_records.json \ --data_root \ --pred_dir \ --out_dir \ --workers 8 ``` ## License Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license. ## Citation ```bibtex @misc{vitex2026, title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing}, author = {Anonymous}, year = {2026}, note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.}, } ```