---
license: apache-2.0
pipeline_tag: text-to-video
tags:
  - video-editing
  - text-editing
  - text-replacement
  - diffusion
---

# ViTeX-Edit-14B (Model & Inference code)

🌐 [Project page](https://vitex-bench.github.io/) &nbsp;·&nbsp;
📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) &nbsp;·&nbsp;
🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) &nbsp;·&nbsp;
🤖 Model & Inference code &nbsp;·&nbsp;
🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)

Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.

> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.


## Repository

```
.
├── inference_example.py            run ViTeX-Edit-14B on one (video, mask, glyph) tuple
├── make_corp_baseline.py           build the ViTeX-Edit-14B (Composite) variant
├── vitex_14b.safetensors           (8 GB, trained adapter weights)
├── diffsynth/                      bundled inference library
└── base_model/                     (70 GB, frozen DiT + T5-XXL + Wan VAE)
```

Self-contained: no extra clones or downloads needed.

## Inputs

| Input | Format |
|---|---|
| `vace_video`  | RGB, 720 × 1280, 121 frames — source video |
| `vace_mask`   | grayscale, same shape — `1 = text region to replace` |
| `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion |
| `prompt`      | text string — the target text |

## Usage

```bash
git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B
conda create -n vitex python=3.12 -y && conda activate vitex
pip install -r requirements.txt

python inference_example.py \
    --vace_video   path/to/source.mp4 \
    --vace_mask    path/to/mask.mp4 \
    --glyph_video  path/to/target_glyph.mp4 \
    --prompt       "HILTON" \
    --output       out.mp4
```

## Locality-preserving variant: ViTeX-Edit-14B (Composite)

`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B.

```bash
python make_corp_baseline.py \
    --records   <data_root>/parsed_records.json \
    --data_root <data_root> \
    --pred_dir  <raw_vitex14b_predictions_dir> \
    --out_dir   <output_dir_for_composite_baseline> \
    --workers   8
```

## License

Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.

## Citation

```bibtex
@misc{vitex2026,
  title  = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
  author = {Anonymous},
  year   = {2026},
  note   = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
}
```