# Preprocessing Specification ## Image (visual.onnx) - **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first) - **Input dtype:** float32 - **Layout:** RGB - **Resolution:** 336×336 (center crop or resize without distortion to fill) - **Normalization:** per-channel `(pixel / 255 - mean) / std` | Channel | mean | std | |---------|------|-----| | R | 0.48145466 | 0.26862954 | | G | 0.4578275 | 0.26130258 | | B | 0.40821073 | 0.27577711 | ## Text (textual.onnx) - **Input shape:** `[N, 77]` - **Input dtype:** int64 - **Lowercase:** yes - **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77 - **Special IDs:** pad=0, unk=1, bos=2, eos=3 - **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe)