| --- |
| license: apache-2.0 |
| pipeline_tag: text-to-video |
| tags: |
| - video-editing |
| - text-editing |
| - text-replacement |
| - diffusion |
| --- |
| |
| # ViTeX-Edit-14B (Model & Inference code) |
|
|
| 🌐 [Project page](https://vitex-bench.github.io/) · |
| 📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) · |
| 🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) · |
| 🤖 Model & Inference code · |
| 🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard) |
|
|
| Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene. |
|
|
| > Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization. |
|
|
|
|
| ## Repository |
|
|
| ``` |
| . |
| ├── inference_example.py run ViTeX-Edit-14B on one (video, mask, glyph) tuple |
| ├── make_corp_baseline.py build the ViTeX-Edit-14B (Composite) variant |
| ├── vitex_14b.safetensors (8 GB, trained adapter weights) |
| ├── diffsynth/ bundled inference library |
| └── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE) |
| ``` |
|
|
| Self-contained: no extra clones or downloads needed. |
|
|
| ## Inputs |
|
|
| | Input | Format | |
| |---|---| |
| | `vace_video` | RGB, 720 × 1280, 121 frames — source video | |
| | `vace_mask` | grayscale, same shape — `1 = text region to replace` | |
| | `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion | |
| | `prompt` | text string — the target text | |
|
|
| ## Usage |
|
|
| ```bash |
| git lfs install |
| git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B |
| conda create -n vitex python=3.12 -y && conda activate vitex |
| pip install -r requirements.txt |
| |
| python inference_example.py \ |
| --vace_video path/to/source.mp4 \ |
| --vace_mask path/to/mask.mp4 \ |
| --glyph_video path/to/target_glyph.mp4 \ |
| --prompt "HILTON" \ |
| --output out.mp4 |
| ``` |
|
|
| ## Locality-preserving variant: ViTeX-Edit-14B (Composite) |
|
|
| `make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B. |
|
|
| ```bash |
| python make_corp_baseline.py \ |
| --records <data_root>/parsed_records.json \ |
| --data_root <data_root> \ |
| --pred_dir <raw_vitex14b_predictions_dir> \ |
| --out_dir <output_dir_for_composite_baseline> \ |
| --workers 8 |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{vitex2026, |
| title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing}, |
| author = {Anonymous}, |
| year = {2026}, |
| note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.}, |
| } |
| ``` |
|
|