File size: 3,588 Bytes

151ad29
 
 
 
 
 
 
 
 
 
9b9fe26
151ad29
8fc9a54
 
 
0ebe983
8fc9a54
78f3fef
0ebe983
9a431ad
0ebe983
0f50cbb
151ad29
0ebe983
9a431ad
 
 
9b9fe26
 
0ebe983
 
 
9a431ad
 
0ebe983
 
151ad29
 
0ebe983
 
 
 
 
 
151ad29
0ebe983
151ad29
 
9a431ad
74f6150
0ebe983
9a431ad
151ad29
9a431ad
 
 
 
0f50cbb
9a431ad
 
 
9b9fe26
78af9f2
9b9fe26
78af9f2
 
 
 
 
 
0ebe983
78af9f2
 
 
0ebe983
151ad29
0ebe983
151ad29
 
 
 
 
0ebe983
9a431ad
151ad29
0ebe983
151ad29

---
license: apache-2.0
pipeline_tag: text-to-video
tags:
  - video-editing
  - text-editing
  - text-replacement
  - diffusion
---

# ViTeX-Edit-14B (Model & Inference code)

🌐 [Project page](https://vitex-bench.github.io/) &nbsp;·&nbsp;
📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) &nbsp;·&nbsp;
🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) &nbsp;·&nbsp;
🤖 Model & Inference code &nbsp;·&nbsp;
🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)

Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.

> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.


## Repository

```
.
├── inference_example.py            run ViTeX-Edit-14B on one (video, mask, glyph) tuple
├── make_corp_baseline.py           build the ViTeX-Edit-14B (Composite) variant
├── vitex_14b.safetensors           (8 GB, trained adapter weights)
├── diffsynth/                      bundled inference library
└── base_model/                     (70 GB, frozen DiT + T5-XXL + Wan VAE)
```

Self-contained: no extra clones or downloads needed.

## Inputs

| Input | Format |
|---|---|
| `vace_video`  | RGB, 720 × 1280, 121 frames — source video |
| `vace_mask`   | grayscale, same shape — `1 = text region to replace` |
| `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion |
| `prompt`      | text string — the target text |

## Usage

```bash
git lfs install
git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B
conda create -n vitex python=3.12 -y && conda activate vitex
pip install -r requirements.txt

python inference_example.py \
    --vace_video   path/to/source.mp4 \
    --vace_mask    path/to/mask.mp4 \
    --glyph_video  path/to/target_glyph.mp4 \
    --prompt       "HILTON" \
    --output       out.mp4
```

## Locality-preserving variant: ViTeX-Edit-14B (Composite)

`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B.

```bash
python make_corp_baseline.py \
    --records   <data_root>/parsed_records.json \
    --data_root <data_root> \
    --pred_dir  <raw_vitex14b_predictions_dir> \
    --out_dir   <output_dir_for_composite_baseline> \
    --workers   8
```

## License

Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.

## Citation

```bibtex
@misc{vitex2026,
  title  = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
  author = {Anonymous},
  year   = {2026},
  note   = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
}
```