Anonymous Authors commited on
Commit
0ebe983
·
1 Parent(s): 8fc9a54

Slim model card to the unified suite template

Browse files

Match the Dataset and Benchmark code READMEs: same five-link header
bar, anonymous-release blockquote, then Specs / Repository / Inputs /
Usage / Composite variant / Limitations / License / Citation. Drops
the redundant "self-contained" paragraph (already implied by the
repository tree) and inline training stage details (those live in the
paper); keeps every command needed to run inference end-to-end.

Files changed (1) hide show
  1. README.md +33 -59
README.md CHANGED
@@ -8,71 +8,58 @@ tags:
8
  - diffusion
9
  ---
10
 
11
- # ViTeX-14B
12
 
13
  🌐 [Project page](https://vitex-bench.github.io/)  · 
14
  📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset)  · 
15
  🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench)  · 
 
16
  🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
17
 
18
- ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.
19
 
20
- This repository is fully self-contained it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.
21
 
22
  ## Specs
23
 
24
  | | |
25
  |---|---|
26
  | Trainable parameters | 4.02 B (VACE blocks + new modules) |
27
- | New modules added | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
28
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
29
- | Resolution | 720 × 1280 |
30
- | Frames | 121 (about 5 s at 24 fps) |
31
- | Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
32
- | Hardware | 8 × NVIDIA H100 80 GB |
33
 
34
- ## Repository contents
35
 
36
  ```
37
  .
38
- ├── README.md
39
- ├── requirements.txt
40
  ├── inference_example.py run ViTeX-14B on one (video, mask, glyph) tuple
41
- ├── make_corp_baseline.py build the ViTeX-14B (Composite) variant from raw predictions
42
- ├── vitex_14b.safetensors (8 GB trained adapter weights)
43
- ├── diffsynth/ (bundled inference library)
44
- └── base_model/ (70 GB frozen base model files)
45
- ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
46
- ├── models_t5_umt5-xxl-enc-bf16.pth
47
- ├── Wan2.1_VAE.pth
48
- └── google/umt5-xxl/ (T5 tokenizer)
49
  ```
50
 
 
 
51
  ## Inputs
52
 
53
- | Input | Format | Description |
54
- |---|---|---|
55
- | `vace_video` | RGB video, 121 frames at 720 × 1280 | Original video containing text to replace |
56
- | `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: 1 = text region to replace, 0 = preserve |
57
- | `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the target text placed where the mask is |
58
- | `prompt` | text string | The target text itself, e.g. `HILTON` |
59
 
60
- ## Installation
61
 
62
  ```bash
63
  git lfs install
64
- git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
65
- cd ViTeX-14B
66
- conda create -n vitex python=3.12 -y
67
- conda activate vitex
68
  pip install -r requirements.txt
69
- ```
70
-
71
- Hardware: 1 × NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 × 1280 × 121 frames.
72
 
73
- ## Usage
74
-
75
- ```bash
76
  python inference_example.py \
77
  --vace_video path/to/source.mp4 \
78
  --vace_mask path/to/mask.mp4 \
@@ -81,49 +68,36 @@ python inference_example.py \
81
  --output out.mp4
82
  ```
83
 
84
- The script automatically uses the bundled `base_model/` and `vitex_14b.safetensors` — no extra downloads.
85
-
86
  ## Locality-preserving variant: ViTeX-14B (Composite)
87
 
88
- `make_corp_baseline.py` is a deterministic, training-free post-processing wrapper that composes ViTeX-14B's predicted text region back onto the source video. Two per-frame operations:
89
-
90
- 1. **Reinhard mean–variance LAB color matching** on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
91
- 2. **Signed-distance feathered alpha compositing** (4-px feather centered on the mask boundary), so the seam is smooth.
92
-
93
- Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.
94
 
95
  ```bash
96
- # Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
97
- # and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
98
  python make_corp_baseline.py \
99
  --records <data_root>/parsed_records.json \
100
  --data_root <data_root> \
101
  --pred_dir <raw_vitex14b_predictions_dir> \
102
- --out_dir <output_dir_for_corp_baseline> \
103
  --workers 8
104
  ```
105
 
106
- CPU-only, runs in ~5 minutes on 8 workers for the 157-clip ViTeX-Bench evaluation split. Requires `ffmpeg` on `PATH`.
107
 
108
- Reference: appendix G of the ViTeX-Bench paper.
 
 
109
 
110
- ## Limitations
111
 
112
- - Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
113
- - Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
114
- - Inference requires the full 14 B base; no quantized variant released.
115
 
116
  ## Citation
117
 
118
  ```bibtex
119
  @misc{vitex2026,
120
- title = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
121
  author = {Anonymous},
122
  year = {2026},
123
- url = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
124
  }
125
  ```
126
-
127
- ## License
128
-
129
- Apache-2.0. See `base_model/LICENSE.txt` for the upstream base model license.
 
8
  - diffusion
9
  ---
10
 
11
+ # ViTeX-14B (Model & Inference code)
12
 
13
  🌐 [Project page](https://vitex-bench.github.io/) &nbsp;·&nbsp;
14
  📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) &nbsp;·&nbsp;
15
  🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) &nbsp;·&nbsp;
16
+ 🤖 Model & Inference code &nbsp;·&nbsp;
17
  🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
18
 
19
+ Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.
20
 
21
+ > Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.
22
 
23
  ## Specs
24
 
25
  | | |
26
  |---|---|
27
  | Trainable parameters | 4.02 B (VACE blocks + new modules) |
28
+ | New modules | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
29
  | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
30
+ | Resolution / frames / fps | 1280 × 720 / 121 / 24 |
31
+ | Hardware | 1 × NVIDIA H100 / A100 80 GB (~70 GB VRAM) |
 
 
32
 
33
+ ## Repository
34
 
35
  ```
36
  .
 
 
37
  ├── inference_example.py run ViTeX-14B on one (video, mask, glyph) tuple
38
+ ├── make_corp_baseline.py build the ViTeX-14B (Composite) variant
39
+ ├── vitex_14b.safetensors (8 GB, trained adapter weights)
40
+ ├── diffsynth/ bundled inference library
41
+ └── base_model/ (70 GB, frozen DiT + T5-XXL + Wan VAE)
 
 
 
 
42
  ```
43
 
44
+ Self-contained: no extra clones or downloads needed.
45
+
46
  ## Inputs
47
 
48
+ | Input | Format |
49
+ |---|---|
50
+ | `vace_video` | RGB, 720 × 1280, 121 frames source video |
51
+ | `vace_mask` | grayscale, same shape `1 = text region to replace` |
52
+ | `glyph_video` | RGB, same shape pre-rendered target-text glyphs warped along source motion |
53
+ | `prompt` | text string the target text |
54
 
55
+ ## Usage
56
 
57
  ```bash
58
  git lfs install
59
+ git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B && cd ViTeX-14B
60
+ conda create -n vitex python=3.12 -y && conda activate vitex
 
 
61
  pip install -r requirements.txt
 
 
 
62
 
 
 
 
63
  python inference_example.py \
64
  --vace_video path/to/source.mp4 \
65
  --vace_mask path/to/mask.mp4 \
 
68
  --output out.mp4
69
  ```
70
 
 
 
71
  ## Locality-preserving variant: ViTeX-14B (Composite)
72
 
73
+ `make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-14B.
 
 
 
 
 
74
 
75
  ```bash
 
 
76
  python make_corp_baseline.py \
77
  --records <data_root>/parsed_records.json \
78
  --data_root <data_root> \
79
  --pred_dir <raw_vitex14b_predictions_dir> \
80
+ --out_dir <output_dir_for_composite_baseline> \
81
  --workers 8
82
  ```
83
 
84
+ ## Limitations
85
 
86
+ * Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
87
+ * Best on planar text; fast-moving or highly distorted text may degrade.
88
+ * Inference requires the full 14 B base; no quantized variant released.
89
 
90
+ ## License
91
 
92
+ Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.
 
 
93
 
94
  ## Citation
95
 
96
  ```bibtex
97
  @misc{vitex2026,
98
+ title = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
99
  author = {Anonymous},
100
  year = {2026},
101
+ note = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
102
  }
103
  ```