Instructions to use ViTeX-Bench/ViTeX-Edit-14B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ViTeX-Bench/ViTeX-Edit-14B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ViTeX-Bench/ViTeX-Edit-14B", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Anonymous Authors commited on
Commit ·
0ebe983
1
Parent(s): 8fc9a54
Slim model card to the unified suite template
Browse filesMatch the Dataset and Benchmark code READMEs: same five-link header
bar, anonymous-release blockquote, then Specs / Repository / Inputs /
Usage / Composite variant / Limitations / License / Citation. Drops
the redundant "self-contained" paragraph (already implied by the
repository tree) and inline training stage details (those live in the
paper); keeps every command needed to run inference end-to-end.
README.md
CHANGED
|
@@ -8,71 +8,58 @@ tags:
|
|
| 8 |
- diffusion
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# ViTeX-14B
|
| 12 |
|
| 13 |
🌐 [Project page](https://vitex-bench.github.io/) ·
|
| 14 |
📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) ·
|
| 15 |
🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) ·
|
|
|
|
| 16 |
🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
## Specs
|
| 23 |
|
| 24 |
| | |
|
| 25 |
|---|---|
|
| 26 |
| Trainable parameters | 4.02 B (VACE blocks + new modules) |
|
| 27 |
-
| New modules
|
| 28 |
| Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
|
| 29 |
-
| Resolution |
|
| 30 |
-
|
|
| 31 |
-
| Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
|
| 32 |
-
| Hardware | 8 × NVIDIA H100 80 GB |
|
| 33 |
|
| 34 |
-
## Repository
|
| 35 |
|
| 36 |
```
|
| 37 |
.
|
| 38 |
-
├── README.md
|
| 39 |
-
├── requirements.txt
|
| 40 |
├── inference_example.py run ViTeX-14B on one (video, mask, glyph) tuple
|
| 41 |
-
├── make_corp_baseline.py build the ViTeX-14B (Composite) variant
|
| 42 |
-
├── vitex_14b.safetensors (8 GB
|
| 43 |
-
├── diffsynth/
|
| 44 |
-
└── base_model/ (70 GB
|
| 45 |
-
├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
|
| 46 |
-
├── models_t5_umt5-xxl-enc-bf16.pth
|
| 47 |
-
├── Wan2.1_VAE.pth
|
| 48 |
-
└── google/umt5-xxl/ (T5 tokenizer)
|
| 49 |
```
|
| 50 |
|
|
|
|
|
|
|
| 51 |
## Inputs
|
| 52 |
|
| 53 |
-
| Input | Format |
|
| 54 |
-
|---|---|
|
| 55 |
-
| `vace_video`
|
| 56 |
-
| `
|
| 57 |
-
| `glyph_video` | RGB
|
| 58 |
-
| `prompt`
|
| 59 |
|
| 60 |
-
##
|
| 61 |
|
| 62 |
```bash
|
| 63 |
git lfs install
|
| 64 |
-
git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
|
| 65 |
-
|
| 66 |
-
conda create -n vitex python=3.12 -y
|
| 67 |
-
conda activate vitex
|
| 68 |
pip install -r requirements.txt
|
| 69 |
-
```
|
| 70 |
-
|
| 71 |
-
Hardware: 1 × NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 × 1280 × 121 frames.
|
| 72 |
|
| 73 |
-
## Usage
|
| 74 |
-
|
| 75 |
-
```bash
|
| 76 |
python inference_example.py \
|
| 77 |
--vace_video path/to/source.mp4 \
|
| 78 |
--vace_mask path/to/mask.mp4 \
|
|
@@ -81,49 +68,36 @@ python inference_example.py \
|
|
| 81 |
--output out.mp4
|
| 82 |
```
|
| 83 |
|
| 84 |
-
The script automatically uses the bundled `base_model/` and `vitex_14b.safetensors` — no extra downloads.
|
| 85 |
-
|
| 86 |
## Locality-preserving variant: ViTeX-14B (Composite)
|
| 87 |
|
| 88 |
-
`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper
|
| 89 |
-
|
| 90 |
-
1. **Reinhard mean–variance LAB color matching** on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
|
| 91 |
-
2. **Signed-distance feathered alpha compositing** (4-px feather centered on the mask boundary), so the seam is smooth.
|
| 92 |
-
|
| 93 |
-
Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.
|
| 94 |
|
| 95 |
```bash
|
| 96 |
-
# Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
|
| 97 |
-
# and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
|
| 98 |
python make_corp_baseline.py \
|
| 99 |
--records <data_root>/parsed_records.json \
|
| 100 |
--data_root <data_root> \
|
| 101 |
--pred_dir <raw_vitex14b_predictions_dir> \
|
| 102 |
-
--out_dir <
|
| 103 |
--workers 8
|
| 104 |
```
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
##
|
| 111 |
|
| 112 |
-
-
|
| 113 |
-
- Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
|
| 114 |
-
- Inference requires the full 14 B base; no quantized variant released.
|
| 115 |
|
| 116 |
## Citation
|
| 117 |
|
| 118 |
```bibtex
|
| 119 |
@misc{vitex2026,
|
| 120 |
-
title = {ViTeX-
|
| 121 |
author = {Anonymous},
|
| 122 |
year = {2026},
|
| 123 |
-
|
| 124 |
}
|
| 125 |
```
|
| 126 |
-
|
| 127 |
-
## License
|
| 128 |
-
|
| 129 |
-
Apache-2.0. See `base_model/LICENSE.txt` for the upstream base model license.
|
|
|
|
| 8 |
- diffusion
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# ViTeX-14B (Model & Inference code)
|
| 12 |
|
| 13 |
🌐 [Project page](https://vitex-bench.github.io/) ·
|
| 14 |
📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) ·
|
| 15 |
🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) ·
|
| 16 |
+
🤖 Model & Inference code ·
|
| 17 |
🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
|
| 18 |
|
| 19 |
+
Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.
|
| 20 |
|
| 21 |
+
> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.
|
| 22 |
|
| 23 |
## Specs
|
| 24 |
|
| 25 |
| | |
|
| 26 |
|---|---|
|
| 27 |
| Trainable parameters | 4.02 B (VACE blocks + new modules) |
|
| 28 |
+
| New modules | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
|
| 29 |
| Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
|
| 30 |
+
| Resolution / frames / fps | 1280 × 720 / 121 / 24 |
|
| 31 |
+
| Hardware | 1 × NVIDIA H100 / A100 80 GB (~70 GB VRAM) |
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
## Repository
|
| 34 |
|
| 35 |
```
|
| 36 |
.
|
|
|
|
|
|
|
| 37 |
├── inference_example.py run ViTeX-14B on one (video, mask, glyph) tuple
|
| 38 |
+
├── make_corp_baseline.py build the ViTeX-14B (Composite) variant
|
| 39 |
+
├── vitex_14b.safetensors (8 GB, trained adapter weights)
|
| 40 |
+
├── diffsynth/ bundled inference library
|
| 41 |
+
└── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
```
|
| 43 |
|
| 44 |
+
Self-contained: no extra clones or downloads needed.
|
| 45 |
+
|
| 46 |
## Inputs
|
| 47 |
|
| 48 |
+
| Input | Format |
|
| 49 |
+
|---|---|
|
| 50 |
+
| `vace_video` | RGB, 720 × 1280, 121 frames — source video |
|
| 51 |
+
| `vace_mask` | grayscale, same shape — `1 = text region to replace` |
|
| 52 |
+
| `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion |
|
| 53 |
+
| `prompt` | text string — the target text |
|
| 54 |
|
| 55 |
+
## Usage
|
| 56 |
|
| 57 |
```bash
|
| 58 |
git lfs install
|
| 59 |
+
git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B && cd ViTeX-14B
|
| 60 |
+
conda create -n vitex python=3.12 -y && conda activate vitex
|
|
|
|
|
|
|
| 61 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
| 62 |
|
|
|
|
|
|
|
|
|
|
| 63 |
python inference_example.py \
|
| 64 |
--vace_video path/to/source.mp4 \
|
| 65 |
--vace_mask path/to/mask.mp4 \
|
|
|
|
| 68 |
--output out.mp4
|
| 69 |
```
|
| 70 |
|
|
|
|
|
|
|
| 71 |
## Locality-preserving variant: ViTeX-14B (Composite)
|
| 72 |
|
| 73 |
+
`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-14B.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
```bash
|
|
|
|
|
|
|
| 76 |
python make_corp_baseline.py \
|
| 77 |
--records <data_root>/parsed_records.json \
|
| 78 |
--data_root <data_root> \
|
| 79 |
--pred_dir <raw_vitex14b_predictions_dir> \
|
| 80 |
+
--out_dir <output_dir_for_composite_baseline> \
|
| 81 |
--workers 8
|
| 82 |
```
|
| 83 |
|
| 84 |
+
## Limitations
|
| 85 |
|
| 86 |
+
* Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
|
| 87 |
+
* Best on planar text; fast-moving or highly distorted text may degrade.
|
| 88 |
+
* Inference requires the full 14 B base; no quantized variant released.
|
| 89 |
|
| 90 |
+
## License
|
| 91 |
|
| 92 |
+
Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.
|
|
|
|
|
|
|
| 93 |
|
| 94 |
## Citation
|
| 95 |
|
| 96 |
```bibtex
|
| 97 |
@misc{vitex2026,
|
| 98 |
+
title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
|
| 99 |
author = {Anonymous},
|
| 100 |
year = {2026},
|
| 101 |
+
note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
|
| 102 |
}
|
| 103 |
```
|
|
|
|
|
|
|
|
|
|
|
|