File size: 3,588 Bytes
151ad29 9b9fe26 151ad29 8fc9a54 0ebe983 8fc9a54 78f3fef 0ebe983 9a431ad 0ebe983 0f50cbb 151ad29 0ebe983 9a431ad 9b9fe26 0ebe983 9a431ad 0ebe983 151ad29 0ebe983 151ad29 0ebe983 151ad29 9a431ad 74f6150 0ebe983 9a431ad 151ad29 9a431ad 0f50cbb 9a431ad 9b9fe26 78af9f2 9b9fe26 78af9f2 0ebe983 78af9f2 0ebe983 151ad29 0ebe983 151ad29 0ebe983 9a431ad 151ad29 0ebe983 151ad29 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | ---
license: apache-2.0
pipeline_tag: text-to-video
tags:
- video-editing
- text-editing
- text-replacement
- diffusion
---
# ViTeX-Edit-14B (Model & Inference code)
🌐 [Project page](https://vitex-bench.github.io/) ·
📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) ·
🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) ·
🤖 Model & Inference code ·
🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.
> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.
## Repository
```
.
├── inference_example.py run ViTeX-Edit-14B on one (video, mask, glyph) tuple
├── make_corp_baseline.py build the ViTeX-Edit-14B (Composite) variant
├── vitex_14b.safetensors (8 GB, trained adapter weights)
├── diffsynth/ bundled inference library
└── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE)
```
Self-contained: no extra clones or downloads needed.
## Inputs
| Input | Format |
|---|---|
| `vace_video` | RGB, 720 × 1280, 121 frames — source video |
| `vace_mask` | grayscale, same shape — `1 = text region to replace` |
| `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion |
| `prompt` | text string — the target text |
## Usage
```bash
git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B
conda create -n vitex python=3.12 -y && conda activate vitex
pip install -r requirements.txt
python inference_example.py \
--vace_video path/to/source.mp4 \
--vace_mask path/to/mask.mp4 \
--glyph_video path/to/target_glyph.mp4 \
--prompt "HILTON" \
--output out.mp4
```
## Locality-preserving variant: ViTeX-Edit-14B (Composite)
`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B.
```bash
python make_corp_baseline.py \
--records <data_root>/parsed_records.json \
--data_root <data_root> \
--pred_dir <raw_vitex14b_predictions_dir> \
--out_dir <output_dir_for_composite_baseline> \
--workers 8
```
## License
Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.
## Citation
```bibtex
@misc{vitex2026,
title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
author = {Anonymous},
year = {2026},
note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
}
```
|