Reproducibility
This document specifies the exact environment, hardware, and runtime in which the headline numbers in results/summary.tsv were produced.
Environments
ViTeX-Bench uses two conda environments because PaddleOCR and PyTorch + pyiqa do not co-install cleanly.
paddleocr (Stage 1 — OCR, CPU)
conda create -n paddleocr python=3.12 -y
conda activate paddleocr
pip install paddleocr opencv-python
PaddleOCR pulls PaddlePaddle 3.x. The pipeline forces CPU execution (CUDA_VISIBLE_DEVICES="") because PaddlePaddle 3.2 does not yet support Blackwell GPUs (RTX 5090, compute capability 12.0). On older NVIDIA generations (Ampere / Hopper) you can switch to GPU OCR by passing --device gpu to benchmark/ocr_extract.py; for that case set --workers 1.
vitex-bench (Stage 2 — GPU metrics)
conda create -n vitex-bench python=3.12 -y
conda activate vitex-bench
pip install -r requirements.txt
A CUDA-enabled PyTorch wheel matching your local CUDA toolkit is required. lpips ships its own AlexNet checkpoint on first call; pyiqa downloads MUSIQ KonIQ weights into ~/.cache/torch/hub/pyiqa/.
Hardware
The headline numbers were produced on:
| component | spec |
|---|---|
| CPU | 32-core x86_64 (Linux) |
| GPU | NVIDIA RTX 5090, 24 GB |
| RAM | 32 GB DDR5 |
| swap | 32 GB (incidentally present from earlier workloads, not needed by this pipeline) |
The pipeline does not require an RTX 5090 specifically — any CUDA-12-capable GPU with $\ge 16$ GB of VRAM is sufficient (LPIPS + MUSIQ + locality compose run with $\le 6$ GB peak on the largest 1280×720 × 120 batch). On a single H100 80 GB the GPU stage is bottlenecked on CPU-side video decoding rather than GPU compute, so a smaller card sees similar wall time.
The CPU stage scales roughly linearly with the number of --workers; we use 8 workers on a 32-core box. Each worker holds one PaddleOCR instance per language family in its task list; peak per-worker resident memory is $\sim 1.5$ GB.
Runtime
End-to-end wall clock for the full nine-baseline grid plus identity sanity (10 baselines, 157 clips, the source-video OCR cached and reused across baselines):
| stage | time |
|---|---|
| identity OCR + source cache build (159 clips × $V$ + $\hat V$) | ~170 min |
| identity GPU evaluate | ~28 min |
| each subsequent baseline OCR ($\hat V$ only, source cached) | ~85 min |
| each subsequent baseline GPU evaluate | ~28 min |
| total (with CPU↔GPU pipelining) | ~13 hours |
The CPU stage and the GPU stage overlap in the master script (scripts/run_benchmark.sh): while baseline N is on the GPU, baseline N+1 is on the CPU. The 13-hour total is therefore $\approx 170 + 9 \times 85 + 28$ min — the source-OCR cost amortized once, plus the per-baseline CPU OCR of every prediction, plus the final GPU evaluate.
Determinism
PP-OCRv5 inference is deterministic given a fixed input image and language code. MUSIQ, LPIPS, SSIM, PSNR are deterministic GPU operations under default PyTorch settings.
The bootstrap CI uses a Python random.Random(0) seeded RNG, so re-running benchmark/evaluate.py against the same eval.json gives identical CIs.
The video-decoder backend (cv2.VideoCapture) decodes H.264 deterministically across runs on the same machine; we have not verified bit-level reproducibility across libavcodec versions.
Step-by-step
# 1. Layout
data/inference/parsed_records.json
data/inference/original_videos/<id>.mp4
data/inference/text_masks/<id>_<ts>_<hash>.mp4
baseline_output_videos/<method>/<id>.mp4
# 2. Identity sanity (also bootstraps the source-OCR cache)
python benchmark/make_identity_baseline.py \
--records data/inference/parsed_records.json \
--src_dir data/inference/original_videos \
--out_dir baseline_output_videos/identity
# 3. Run the full grid (10 baselines, ~13 hours)
bash scripts/run_benchmark.sh
# 4. Inspect aggregates
column -t -s $'\t' outputs/summary.tsv
# 5. Per-clip JSON for the analysis you want
python -c "
import json
d = json.load(open('outputs/ViTeX-14B/eval.json'))
for k, v in sorted(d['per_clip'].items())[:5]:
print(k, v['SeqAcc'], v['CharAcc'], v['TTS'])
"
What if my numbers don't match?
Differences of a few percentage points are expected because:
- Mask file timestamps in
parsed_records.jsonchange between dataset re-builds; the binary mask itself is stable. - Some Family A baselines (especially those distributed without exact training-data hashes) may produce slightly different fine-tuned outputs across hardware.
- PP-OCRv5 model weights are versioned; the
_PADDLEX_VERSIONin your install determines the exact recognizer.
If your numbers diverge by more than ±1 % on aggregate metrics, the most common causes are (i) prediction videos at the wrong spatial resolution (the pipeline auto-resamples but Lanczos vs bilinear can differ slightly) and (ii) mask binarization threshold mismatches when masks are stored as PNG sequences instead of MP4 — convert to MP4 before invoking the pipeline.
For substantive deviations, please open an issue with your outputs/<baseline>/log.txt and a short diff of summary.tsv.