Slim model card to the unified suite template

Match the Dataset and Benchmark code READMEs: same five-link header
bar, anonymous-release blockquote, then Specs / Repository / Inputs /
Usage / Composite variant / Limitations / License / Citation. Drops
the redundant "self-contained" paragraph (already implied by the
repository tree) and inline training stage details (those live in the
paper); keeps every command needed to run inference end-to-end.

Files changed (1) hide show

README.md +33 -59

README.md CHANGED Viewed

@@ -8,71 +8,58 @@ tags:
   - diffusion
 ---
-# ViTeX-14B
 🌐 [Project page](https://vitex-bench.github.io/) &nbsp;·&nbsp;
 📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) &nbsp;·&nbsp;
 🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) &nbsp;·&nbsp;
 🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
-ViTeX is a video text editing model. It replaces text content inside a user-provided mask region of a video while preserving the original visual style (font, color, stroke, shadow, perspective) and the surrounding scene.
-This repository is fully self-contained — it bundles the trained weights, the full base model required for inference, and all custom code. No external code repositories or third-party model downloads are required.
 ## Specs
 |  |  |
 |---|---|
 | Trainable parameters | 4.02 B (VACE blocks + new modules) |
-| New modules added | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
-| Resolution | 720 × 1280 |
-| Frames | 121 (about 5 s at 24 fps) |
-| Training | Stage 1: 5 epochs at 49 frames (22 h) ; Stage 2: 2 epochs at 121 frames (30 h) |
-| Hardware | 8 × NVIDIA H100 80 GB |
-## Repository contents
 ```
 .
-├── README.md
-├── requirements.txt
 ├── inference_example.py            run ViTeX-14B on one (video, mask, glyph) tuple
-├── make_corp_baseline.py           build the ViTeX-14B (Composite) variant from raw predictions
-├── vitex_14b.safetensors           (8 GB — trained adapter weights)
-├── diffsynth/                      (bundled inference library)
-└── base_model/                     (70 GB — frozen base model files)
-    ├── diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
-    ├── models_t5_umt5-xxl-enc-bf16.pth
-    ├── Wan2.1_VAE.pth
-    └── google/umt5-xxl/             (T5 tokenizer)
 ```
 ## Inputs
-| Input | Format | Description |
-|---|---|---|
-| `vace_video` | RGB video, 121 frames at 720 × 1280 | Original video containing text to replace |
-| `vace_video_mask` | grayscale video, same shape | Per-frame binary mask: 1 = text region to replace, 0 = preserve |
-| `glyph_video` | RGB video, same shape | Pre-rendered glyphs of the target text placed where the mask is |
-| `prompt` | text string | The target text itself, e.g. `HILTON` |
-## Installation
 ```bash
 git lfs install
-git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B
-cd ViTeX-14B
-conda create -n vitex python=3.12 -y
-conda activate vitex
 pip install -r requirements.txt
-```
-Hardware: 1 × NVIDIA GPU with 80 GB VRAM (H100 / A100 80 GB). Inference uses about 70 GB VRAM at 720 × 1280 × 121 frames.
-## Usage
-```bash
 python inference_example.py \
     --vace_video   path/to/source.mp4 \
     --vace_mask    path/to/mask.mp4 \
@@ -81,49 +68,36 @@ python inference_example.py \
     --output       out.mp4
 ```
-The script automatically uses the bundled `base_model/` and `vitex_14b.safetensors` — no extra downloads.
 ## Locality-preserving variant: ViTeX-14B (Composite)
-`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper that composes ViTeX-14B's predicted text region back onto the source video. Two per-frame operations:
-1. **Reinhard mean–variance LAB color matching** on a 20-px band just outside the mask, so the predicted glyphs match the source's local lighting.
-2. **Signed-distance feathered alpha compositing** (4-px feather centered on the mask boundary), so the seam is smooth.
-Inside the mask the result is the predicted glyphs (color-matched); outside the feather the result is byte-identical to the source. SeqAcc / CharAcc are within ~0.01 of raw ViTeX-14B (the predicted text region is unchanged), but PSNR / SSIM / LPIPS / DreamSim jump to near-Identity because the unedited region no longer pays the VAE round-trip penalty.
 ```bash
-# Assumes you already have raw ViTeX-14B predictions in <pred_dir>/*.mp4
-# and the eval split of ViTeX-Dataset under <data_root> (eval/original_videos/, eval/masks/).
 python make_corp_baseline.py \
     --records   <data_root>/parsed_records.json \
     --data_root <data_root> \
     --pred_dir  <raw_vitex14b_predictions_dir> \
-    --out_dir   <output_dir_for_corp_baseline> \
     --workers   8
 ```
-CPU-only, runs in ~5 minutes on 8 workers for the 157-clip ViTeX-Bench evaluation split. Requires `ffmpeg` on `PATH`.
-Reference: appendix G of the ViTeX-Bench paper.
-## Limitations
-- Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
-- Best on planar text (signs, posters); fast-moving or highly distorted text may degrade.
-- Inference requires the full 14 B base; no quantized variant released.
 ## Citation
 ```bibtex
 @misc{vitex2026,
-  title  = {ViTeX-14B: Visual Text Editing in Video via Style-Preserving Glyph Conditioning},
   author = {Anonymous},
   year   = {2026},
-  url    = {https://huggingface.co/ViTeX-Bench/ViTeX-14B},
 }
 ```
-## License
-Apache-2.0. See `base_model/LICENSE.txt` for the upstream base model license.

   - diffusion
 ---
+# ViTeX-14B (Model & Inference code)
 🌐 [Project page](https://vitex-bench.github.io/) &nbsp;·&nbsp;
 📊 [Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) &nbsp;·&nbsp;
 🧪 [Benchmark code](https://huggingface.co/ViTeX-Bench/ViTeX-Bench) &nbsp;·&nbsp;
+🤖 Model & Inference code &nbsp;·&nbsp;
 🏆 [Leaderboard](https://huggingface.co/spaces/ViTeX-Bench/ViTeX-Bench-Leaderboard)
+Open reference model for **video scene text editing**. Augments Wan2.1-VACE-14B with a glyph-video conditioning pathway that supplies temporally aligned target-text structure to the editing backbone, fine-tuned on the [ViTeX-Dataset](https://huggingface.co/datasets/ViTeX-Bench/ViTeX-Dataset) 230-clip training split. Replaces the masked scene text in a video while preserving font, color, stroke, shadow, perspective, and the surrounding scene.
+> Anonymous release under double-blind review at NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.
 ## Specs
 |  |  |
 |---|---|
 | Trainable parameters | 4.02 B (VACE blocks + new modules) |
+| New modules | 971 M (GlyphEncoder + 8 × ConditionCrossAttention) |
 | Total inference params | ~24 B (DiT 18.3 B + T5-XXL 5.7 B + Wan VAE 0.13 B) |
+| Resolution / frames / fps | 1280 × 720 / 121 / 24 |
+| Hardware | 1 × NVIDIA H100 / A100 80 GB (~70 GB VRAM) |
+## Repository
 ```
 .
 ├── inference_example.py            run ViTeX-14B on one (video, mask, glyph) tuple
+├── make_corp_baseline.py           build the ViTeX-14B (Composite) variant
+├── vitex_14b.safetensors           (8 GB, trained adapter weights)
+├── diffsynth/                      bundled inference library
+└── base_model/                     (70 GB, frozen DiT + T5-XXL + Wan VAE)
 ```
+Self-contained: no extra clones or downloads needed.
 ## Inputs
+| Input | Format |
+|---|---|
+| `vace_video`  | RGB, 720 × 1280, 121 frames — source video |
+| `vace_mask`   | grayscale, same shape — `1 = text region to replace` |
+| `glyph_video` | RGB, same shape — pre-rendered target-text glyphs warped along source motion |
+| `prompt`      | text string — the target text |
+## Usage
 ```bash
 git lfs install
+git clone https://huggingface.co/ViTeX-Bench/ViTeX-14B && cd ViTeX-14B
+conda create -n vitex python=3.12 -y && conda activate vitex
 pip install -r requirements.txt
 python inference_example.py \
     --vace_video   path/to/source.mp4 \
     --vace_mask    path/to/mask.mp4 \
     --output       out.mp4
 ```
 ## Locality-preserving variant: ViTeX-14B (Composite)
+`make_corp_baseline.py` is a deterministic, training-free post-processing wrapper. Two per-frame operations: (1) Reinhard mean–variance LAB color matching against the source's local lighting; (2) signed-distance feathered alpha compositing onto the source. Inside the mask the result is the predicted glyphs (color-matched); outside the feather it is byte-identical to the source. Locality metrics rise to near-Identity while SeqAcc / CharAcc move within ~0.01 of raw ViTeX-14B.
 ```bash
 python make_corp_baseline.py \
     --records   <data_root>/parsed_records.json \
     --data_root <data_root> \
     --pred_dir  <raw_vitex14b_predictions_dir> \
+    --out_dir   <output_dir_for_composite_baseline> \
     --workers   8
 ```
+## Limitations
+* Trained on 230 samples; coverage of artistic fonts, complex backgrounds, and non-Latin scripts is limited.
+* Best on planar text; fast-moving or highly distorted text may degrade.
+* Inference requires the full 14 B base; no quantized variant released.
+## License
+Apache-2.0 (this code and adapter weights). See `base_model/LICENSE.txt` for the upstream base-model license.
 ## Citation
 ```bibtex
 @misc{vitex2026,
+  title  = {ViTeX-Bench: Benchmarking High Fidelity Video Scene Text Editing},
   author = {Anonymous},
   year   = {2026},
+  note   = {Submitted to NeurIPS 2026 Datasets and Benchmarks Track. Author list and DOI updated after deanonymization.},
 }
 ```