# Preprocessing Specification

## Image (visual.onnx)

- **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first)
- **Input dtype:** float32
- **Layout:** RGB
- **Resolution:** 336×336 (center crop or resize without distortion to fill)
- **Normalization:** per-channel `(pixel / 255 - mean) / std`

| Channel | mean | std |
|---------|------|-----|
| R | 0.48145466 | 0.26862954 |
| G | 0.4578275 | 0.26130258 |
| B | 0.40821073 | 0.27577711 |

## Text (textual.onnx)

- **Input shape:** `[N, 77]`
- **Input dtype:** int64
- **Lowercase:** yes
- **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77
- **Special IDs:** pad=0, unk=1, bos=2, eos=3
- **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe)