| # Preprocessing Specification | |
| ## Image (visual.onnx) | |
| - **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first) | |
| - **Input dtype:** float32 | |
| - **Layout:** RGB | |
| - **Resolution:** 336×336 (center crop or resize without distortion to fill) | |
| - **Normalization:** per-channel `(pixel / 255 - mean) / std` | |
| | Channel | mean | std | | |
| |---------|------|-----| | |
| | R | 0.48145466 | 0.26862954 | | |
| | G | 0.4578275 | 0.26130258 | | |
| | B | 0.40821073 | 0.27577711 | | |
| ## Text (textual.onnx) | |
| - **Input shape:** `[N, 77]` | |
| - **Input dtype:** int64 | |
| - **Lowercase:** yes | |
| - **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77 | |
| - **Special IDs:** pad=0, unk=1, bos=2, eos=3 | |
| - **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe) | |