ViTeX-Bench
/

ViTeX-Edit-14B

text-replacement

Model card Files Files and versions

ViTeX-Edit-14B / README.md

ViTeX-Bench's picture

Update README.md

cf4570a verified 7 days ago

|

history blame contribute delete

3.59 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-video
	tags:
	- video-editing
	- text-editing
	- text-replacement
	- diffusion
	---

	# ViTeX-Edit-14B (Model & Inference code)

	🌐 [Project page](https://vitex-bench.github.io/)  ·
	📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset)  ·
	🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench)  ·
	🤖 Model & Inference code  ·
	🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)

	Open reference model for video scene text editing. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.

	> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.


	## Repository

	```
	.
	├── inference_example.py run ViTeX-Edit-14B on one (video, mask, glyph) tuple
	├── make_corp_baseline.py build the ViTeX-Edit-14B (Composite) variant
	├── vitex_14b.safetensors (8 GB, trained adapter weights)
	├── diffsynth/ bundled inference library
	└── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE)
	```

	Self-contained: no extra clones or downloads needed.

	## Inputs

	\| Input \| Format \|
	\|---\|---\|
	\| `vace_video` \| RGB, 720 × 1280, 121 frames — source video \|
	\| `vace_mask` \| grayscale, same shape — `1 = text region to replace` \|
	\| `glyph_video` \| RGB, same shape — pre-rendered target-text glyphs warped along source motion \|
	\| `prompt` \| text string — the target text \|

	## Usage

	```bash
	git lfs install
	git clone https://huggingface.co/ViTeX-Bench/ViTeX-Edit-14B && cd ViTeX-Edit-14B
	conda create -n vitex python=3.12 -y && conda activate vitex
	pip install -r requirements.txt

	python inference_example.py \
	--vace_video path/to/source.mp4 \
	--vace_mask path/to/mask.mp4 \
	--glyph_video path/to/target_glyph.mp4 \
	--prompt "HILTON" \
	--output out.mp4
	```

	## Locality-preserving variant: ViTeX-Edit-14B (Composite)

	`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-Edit-14B.

	```bash
	python make_corp_baseline.py \
	--records <data_root>/parsed_records.json \
	--data_root <data_root> \
	--pred_dir <raw_vitex14b_predictions_dir> \
	--out_dir <output_dir_for_composite_baseline> \
	--workers 8
	```

	## License

	Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.

	## Citation

	```bibtex
	@misc{vitex2026,
	title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
	author = {Anonymous},
	year = {2026},
	note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
	}
	```