alangmartini commited on
Commit
bd847c9
·
verified ·
1 Parent(s): abea3da

Attach reproducer + verifier scripts as proof

Browse files
fashion-clip-vit-b-p32/_reproducer/README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FashionCLIP ONNX export for Typesense
2
+
3
+ Builds an ONNX-format copy of [patrickjohncyh/fashion-clip](https://huggingface.co/patrickjohncyh/fashion-clip) packaged in the layout Typesense expects, and verifies it locally before uploading to [typesense/models-moved](https://huggingface.co/typesense/models-moved).
4
+
5
+ FashionCLIP is a CLIP-ViT-B/32 fine-tuned on Farfetch fashion images. Drop-in replacement for `ts/clip-vit-b-p32` when the corpus is fashion/apparel.
6
+
7
+ ## Why this exists
8
+
9
+ Typesense ships first-party support for OpenAI CLIP ViT-B/32. FashionCLIP uses the identical architecture but is fine-tuned on fashion data, which materially improves retrieval on apparel datasets. Hosting it on `typesense/models-moved` lets Cloud users select it with `model_name: "ts/fashion-clip-vit-b-p32"`.
10
+
11
+ Because the architecture matches CLIP ViT-B/32 exactly, the `vocab.txt` (CLIP BPE merges) and `clip_image_processor.onnx` files are reused byte-for-byte from `clip-vit-b-p32`. Only `model.onnx` is new.
12
+
13
+ ## Files in this directory
14
+
15
+ | File | Purpose |
16
+ |---|---|
17
+ | `verify_onnx.py` | Exports nothing - just runs the staged `model.onnx` and `clip_image_processor.onnx` through onnxruntime, compares against HF transformers reference (cosine should be 1.0), and validates Typesense's exact I/O contract (input/output tensor names, shapes, independence of text/image towers). |
18
+ | `export.py` | Re-runnable export script: downloads FashionCLIP, exports `model.onnx` via Optimum, pulls reusable artifacts from the existing Typesense CLIP repo, writes `config.json` with current MD5s. |
19
+ | `reproduce.sh` | Full local Docker test: mounts the staged model into a Typesense 29.0 container, creates a `fashion-items` collection with an image embedding field, indexes 5 fashion images, runs text>image and image>image searches. |
20
+ | `out-staging/fashion-clip-vit-b-p32/` | The exact 5 files to upload to HuggingFace under `typesense/models-moved/fashion-clip-vit-b-p32/`. |
21
+ | `fashion-clip-src/` | Cache of the upstream HF snapshot. Gitignored. |
22
+
23
+ ## What ships in the HF bundle
24
+
25
+ ```
26
+ fashion-clip-vit-b-p32/
27
+ ├── model.onnx # FashionCLIP weights (~578 MB)
28
+ ├── clip_tokenizer.onnx # Reused from clip-vit-b-p32 (1.3 MB)
29
+ ├── clip_image_processor.onnx # Reused from clip-vit-b-p32 (~4 KB)
30
+ ├── vocab.txt # CLIP BPE merges, reused (3.0 MB)
31
+ └── config.json # Typesense metadata + MD5 checksums
32
+ ```
33
+
34
+ The HF README at `typesense/models-moved/fashion-clip-vit-b-p32/README.md` should credit `patrickjohncyh/fashion-clip` upstream.
35
+
36
+ ## Verification status
37
+
38
+ Run from this directory:
39
+
40
+ ```powershell
41
+ # 1. ONNX runtime sanity (no Docker required)
42
+ .venv/Scripts/python.exe verify_onnx.py
43
+
44
+ # 2. Live Typesense Docker test (requires working Docker)
45
+ bash reproduce.sh
46
+ ```
47
+
48
+ **ONNX sanity (verify_onnx.py)** confirms:
49
+ - `model.onnx` has the inputs Typesense uses (`input_ids`, `pixel_values`, `attention_mask`)
50
+ - `model.onnx` outputs `text_embeds` and `image_embeds` at `[B, 512]`
51
+ - The first 2D output Typesense's auto-detect would pick is `text_embeds` (the dynamic-dim `logits_per_*` outputs get skipped, matching `clip-vit-b-p32` behavior)
52
+ - `clip_image_processor.onnx` accepts raw image bytes and returns `(1, 3, 224, 224)`
53
+ - ONNX outputs match the HF transformers reference at cosine = 1.0 for both text and image
54
+ - The text tower is independent of `pixel_values` (Typesense's text path uses dummy 0.5 pixels)
55
+ - The image tower is independent of `input_ids` (Typesense's image path uses dummy `[[0]]` ids)
56
+
57
+ **Live Docker test (reproduce.sh)** confirms:
58
+ - Typesense loads `fashion-clip-vit-b-p32` from the local models dir without errors
59
+ - The CLIP tokenizer + image processor work end to end
60
+ - Top result for `"a red dress"` is the red cocktail dress
61
+ - Top result for `"comfortable shoes"` is the white sneakers
62
+ - Image>image with `red-dress.jpg` puts the red dress first
63
+
64
+ ## Upload to HuggingFace (via PR)
65
+
66
+ We don't have direct write access to `typesense/models-moved`, so we open a community PR. The `hf` CLI (huggingface_hub >= 0.34) is the supported tool.
67
+
68
+ ```bash
69
+ hf auth login # paste a token from https://huggingface.co/settings/tokens
70
+ hf upload typesense/models-moved \
71
+ ./out-staging/fashion-clip-vit-b-p32 \
72
+ fashion-clip-vit-b-p32 \
73
+ --repo-type model \
74
+ --create-pr \
75
+ --commit-message "Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)" \
76
+ --commit-description "FashionCLIP fine-tuned on Farfetch; CLIP-ViT-B/32 architecture so vocab.txt and clip_image_processor.onnx are reused byte-for-byte from clip-vit-b-p32."
77
+ ```
78
+
79
+ The PR appears under https://huggingface.co/typesense/models-moved/discussions . A repo maintainer reviews and merges. After merge, Typesense Cloud users reference it as `model_name: "ts/fashion-clip-vit-b-p32"`.
80
+
81
+ After upload, Typesense Cloud users reference it as:
82
+
83
+ ```json
84
+ {
85
+ "name": "embedding",
86
+ "type": "float[]",
87
+ "embed": {
88
+ "from": ["image"],
89
+ "model_config": {"model_name": "ts/fashion-clip-vit-b-p32"}
90
+ }
91
+ }
92
+ ```
93
+
94
+ ## License notes
95
+
96
+ - FashionCLIP is released under MIT (per the upstream repo card). The ONNX export inherits that license. The HF README on `typesense/models-moved` should preserve the citation:
97
+ > Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Goncalves, D., Greco, C., Tagliabue, J. *Contrastive language and vision learning of general fashion concepts*, Scientific Reports 12, 18958 (2022).
fashion-clip-vit-b-p32/_reproducer/reproduce.sh ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Reproducer: verify a self exported FashionCLIP ONNX model works end to end in
3
+ # a local Typesense container before pushing it to typesense/models-moved.
4
+ #
5
+ # What this script does:
6
+ # 1. Spins up a Typesense Docker container on port 8108.
7
+ # 2. Mounts ./out staging/fashion clip vit b p32/ into the container's models dir
8
+ # so Typesense treats it as a LOCAL model (no namespace prefix, no download).
9
+ # 3. Creates a collection with an image embedding field that points at the model.
10
+ # 4. Indexes a tiny set of fashion images.
11
+ # 5. Runs a text>image search ("a red dress") and an image>image search to
12
+ # confirm the model returns sensible top results.
13
+ #
14
+ # Notes:
15
+ # - This validates the same load path that Typesense uses for ts/clip vit b p32.
16
+ # - Once the files are uploaded to https://huggingface.co/typesense/models moved
17
+ # under fashion clip vit b p32/, Typesense Cloud users can reference it as
18
+ # model_name: "ts/fashion clip vit b p32". The local mode test here is
19
+ # functionally equivalent: same code path, same config, same I/O.
20
+
21
+ set -e
22
+
23
+ # On Git Bash / MSYS2 (Windows), `/data` gets auto-translated to
24
+ # `C:/Program Files/Git/data` which breaks Docker volume mounts and CLI args.
25
+ # Disable that translation for the whole script.
26
+ export MSYS_NO_PATHCONV=1
27
+ export MSYS2_ARG_CONV_EXCL='*'
28
+
29
+ ### Configuration ##############################################################
30
+
31
+ TYPESENSE_API_KEY=xyz
32
+ PORT=8108
33
+ TYPESENSE_HOST=http://localhost:${PORT}
34
+ CONTAINER_NAME=typesense-fashion-clip-onnx
35
+ TYPESENSE_VERSION=29.0
36
+ SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
37
+ STAGING_DIR="${SCRIPT_DIR}/out-staging/fashion-clip-vit-b-p32"
38
+ DATA_DIR="${SCRIPT_DIR}/typesense-data-${CONTAINER_NAME}"
39
+
40
+ if [ ! -f "${STAGING_DIR}/model.onnx" ]; then
41
+ echo "ERROR: staging dir missing model.onnx; run the export step first (verify_onnx.py)"
42
+ exit 1
43
+ fi
44
+
45
+ # Python image helper below uses this.
46
+ export SCRIPT_DIR
47
+
48
+ ### Cleanup ####################################################################
49
+
50
+ cleanup() {
51
+ echo ""
52
+ echo "=== Cleanup ==="
53
+ docker stop ${CONTAINER_NAME} 2>/dev/null || true
54
+ docker rm ${CONTAINER_NAME} 2>/dev/null || true
55
+ # Container ran as root and wrote into the mounted data dir; on Windows the
56
+ # host user can't `rm` those files. Use a throwaway root container to wipe
57
+ # the dir contents from inside the same UID, then drop the now empty dir.
58
+ if [ -d "${DATA_DIR}" ]; then
59
+ if command -v cygpath > /dev/null 2>&1; then
60
+ _w=$(cygpath -w "${DATA_DIR}")
61
+ else
62
+ _w="${DATA_DIR}"
63
+ fi
64
+ docker run --rm -v "${_w}:/data" alpine:3 sh -c "rm -rf /data/* /data/.??*" 2>/dev/null || true
65
+ rmdir "${DATA_DIR}" 2>/dev/null || true
66
+ fi
67
+ echo "Cleanup complete"
68
+ }
69
+ trap cleanup EXIT
70
+
71
+ ### Setup Typesense ############################################################
72
+
73
+ echo "=== Setting up Typesense ${TYPESENSE_VERSION} ==="
74
+ docker stop ${CONTAINER_NAME} 2>/dev/null || true
75
+ docker rm ${CONTAINER_NAME} 2>/dev/null || true
76
+
77
+ mkdir -p "${DATA_DIR}/models"
78
+ # Place the model files under the Typesense data dir at models/fashion-clip-vit-b-p32/
79
+ # (no ts_ prefix => loaded as a local model rather than a public download).
80
+ cp -R "${STAGING_DIR}" "${DATA_DIR}/models/"
81
+
82
+ # Docker on Windows wants a Windows-style host path for -v mounts.
83
+ if command -v cygpath > /dev/null 2>&1; then
84
+ WIN_DATA_DIR=$(cygpath -w "${DATA_DIR}")
85
+ else
86
+ WIN_DATA_DIR="${DATA_DIR}"
87
+ fi
88
+
89
+ docker run -d \
90
+ --name ${CONTAINER_NAME} \
91
+ -p ${PORT}:8108 \
92
+ -v "${WIN_DATA_DIR}:/data" \
93
+ typesense/typesense:${TYPESENSE_VERSION} \
94
+ --data-dir=/data \
95
+ --api-key=${TYPESENSE_API_KEY} \
96
+ --enable-cors
97
+
98
+ echo "Waiting for Typesense to be ready..."
99
+ # /health returns {"ok": true} only once Raft has elected a leader. Poll for the
100
+ # JSON body rather than just exit code so we don't race with leader election.
101
+ for _ in $(seq 1 60); do
102
+ body=$(curl -s "${TYPESENSE_HOST}/health" 2>/dev/null || true)
103
+ if echo "$body" | grep -q '"ok":true'; then break; fi
104
+ sleep 1
105
+ done
106
+ # Belt and suspenders: also confirm /debug returns ok=true (means Raft is ready
107
+ # enough to accept writes). Up to v29, /health alone can lie under fast restart.
108
+ for _ in $(seq 1 30); do
109
+ debug_body=$(curl -s -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "${TYPESENSE_HOST}/debug" 2>/dev/null || true)
110
+ if echo "$debug_body" | grep -q '"state":1'; then break; fi
111
+ sleep 1
112
+ done
113
+ echo "Typesense is ready"
114
+
115
+ wait_for_collection() {
116
+ local collection=$1
117
+ local max_wait=${2:-30}
118
+ local count=0
119
+ while [ $count -lt $max_wait ]; do
120
+ if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/collections/${collection}" \
121
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null | grep -q "200"; then
122
+ return 0
123
+ fi
124
+ sleep 1
125
+ count=$((count + 1))
126
+ done
127
+ echo "WARNING: Collection '${collection}' not ready after ${max_wait}s"
128
+ return 1
129
+ }
130
+
131
+ ### Create collection ##########################################################
132
+
133
+ echo ""
134
+ echo "=== Creating fashion-items collection ==="
135
+
136
+ CREATE_RESP=$(curl -s "${TYPESENSE_HOST}/collections" \
137
+ -X POST \
138
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
139
+ -H "Content-Type: application/json" \
140
+ -d '{
141
+ "name": "fashion-items",
142
+ "fields": [
143
+ {"name": "title", "type": "string"},
144
+ {"name": "image", "type": "image", "store": false},
145
+ {
146
+ "name": "embedding",
147
+ "type": "float[]",
148
+ "embed": {
149
+ "from": ["image"],
150
+ "model_config": {"model_name": "fashion-clip-vit-b-p32"}
151
+ }
152
+ }
153
+ ]
154
+ }')
155
+ echo "create response: $CREATE_RESP"
156
+ if ! echo "$CREATE_RESP" | jq -e '.name' > /dev/null; then
157
+ echo ""
158
+ echo "Collection creation FAILED. Recent Typesense logs:"
159
+ docker logs --tail 80 ${CONTAINER_NAME}
160
+ exit 1
161
+ fi
162
+
163
+ wait_for_collection "fashion-items"
164
+
165
+ ### Index a small fashion dataset ##############################################
166
+
167
+ echo ""
168
+ echo "=== Preparing sample fashion images ==="
169
+ mkdir -p "${SCRIPT_DIR}/sample-images"
170
+ # Python on Windows can't resolve /c/Users style paths, convert to C:\... form.
171
+ if command -v cygpath > /dev/null 2>&1; then
172
+ WIN_SCRIPT_DIR=$(cygpath -w "${SCRIPT_DIR}")
173
+ else
174
+ WIN_SCRIPT_DIR="${SCRIPT_DIR}"
175
+ fi
176
+ export SCRIPT_DIR="${WIN_SCRIPT_DIR}"
177
+ # Use Pillow in the project venv to fetch real photos with a sane User-Agent,
178
+ # falling back to solid-color placeholders that still exercise the embedder.
179
+ PY="${SCRIPT_DIR}/.venv/Scripts/python.exe"
180
+ "$PY" - <<'PYEOF'
181
+ import os, urllib.request
182
+ from PIL import Image, ImageDraw
183
+
184
+ OUT = os.path.join(os.environ['SCRIPT_DIR'], 'sample-images')
185
+ os.makedirs(OUT, exist_ok=True)
186
+ ITEMS = [
187
+ ('red-dress', (200, 20, 40), 'https://images.pexels.com/photos/985635/pexels-photo-985635.jpeg?w=400'),
188
+ ('blue-jeans', ( 30, 60, 140), 'https://images.pexels.com/photos/52518/jeans-pants-blue-pocket-52518.jpeg?w=400'),
189
+ ('white-sneakers', (240, 240, 240), 'https://images.pexels.com/photos/1102776/pexels-photo-1102776.jpeg?w=400'),
190
+ ('leather-handbag', (110, 60, 30), 'https://images.pexels.com/photos/1152077/pexels-photo-1152077.jpeg?w=400'),
191
+ ('wool-sweater', (180, 140, 90), 'https://images.pexels.com/photos/1721934/pexels-photo-1721934.jpeg?w=400'),
192
+ ]
193
+ for name, color, url in ITEMS:
194
+ path = os.path.join(OUT, name + '.jpg')
195
+ if os.path.exists(path) and os.path.getsize(path) > 1024:
196
+ continue
197
+ try:
198
+ req = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
199
+ with urllib.request.urlopen(req, timeout=20) as r:
200
+ data = r.read()
201
+ with open(path,'wb') as f:
202
+ f.write(data)
203
+ print(f' fetched {name}: {len(data)} bytes')
204
+ except Exception as e:
205
+ # placeholder
206
+ img = Image.new('RGB', (400, 400), color)
207
+ d = ImageDraw.Draw(img)
208
+ d.text((10, 10), name, fill=(255,255,255) if sum(color)<400 else (0,0,0))
209
+ img.save(path, 'JPEG', quality=85)
210
+ print(f' placeholder {name} (fetch failed: {e})')
211
+ PYEOF
212
+
213
+ echo ""
214
+ echo "=== Indexing fashion items ==="
215
+ build_doc() {
216
+ local id=$1 title=$2 file=$3
217
+ local b64
218
+ b64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/${file}.jpg")
219
+ printf '{"id":"%s","title":"%s","image":"%s"}\n' "$id" "$title" "$b64"
220
+ }
221
+ {
222
+ build_doc 1 "Red cocktail dress" red-dress
223
+ build_doc 2 "Blue denim jeans" blue-jeans
224
+ build_doc 3 "White leather sneakers" white-sneakers
225
+ build_doc 4 "Brown leather handbag" leather-handbag
226
+ build_doc 5 "Knit wool sweater" wool-sweater
227
+ } | curl -s "${TYPESENSE_HOST}/collections/fashion-items/documents/import?action=create" \
228
+ -X POST \
229
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
230
+ -H "Content-Type: text/plain" \
231
+ --data-binary @-
232
+
233
+ echo ""
234
+
235
+ ### Text>image search ##########################################################
236
+
237
+ echo "=== Text>image search: 'a red dress' ==="
238
+ curl -s "${TYPESENSE_HOST}/multi_search" \
239
+ -X POST \
240
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
241
+ -d '{
242
+ "searches": [
243
+ {
244
+ "collection": "fashion-items",
245
+ "q": "a red dress",
246
+ "query_by": "embedding",
247
+ "prefix": false,
248
+ "include_fields": "id,title",
249
+ "per_page": 3
250
+ }
251
+ ]
252
+ }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
253
+
254
+ echo ""
255
+ echo "=== Text>image search: 'comfortable shoes' ==="
256
+ curl -s "${TYPESENSE_HOST}/multi_search" \
257
+ -X POST \
258
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
259
+ -d '{
260
+ "searches": [
261
+ {
262
+ "collection": "fashion-items",
263
+ "q": "comfortable shoes",
264
+ "query_by": "embedding",
265
+ "prefix": false,
266
+ "include_fields": "id,title",
267
+ "per_page": 3
268
+ }
269
+ ]
270
+ }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
271
+
272
+ ### Image>image search #########################################################
273
+
274
+ echo ""
275
+ echo "=== Image>image search using red-dress.jpg ==="
276
+ QUERY_IMG_B64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/red-dress.jpg")
277
+ curl -s "${TYPESENSE_HOST}/multi_search" \
278
+ -X POST \
279
+ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
280
+ -H "Content-Type: application/json" \
281
+ -d "$(jq -n --arg img "$QUERY_IMG_B64" '{
282
+ searches: [{
283
+ collection: "fashion-items",
284
+ q: "*",
285
+ vector_query: ("embedding:([], queries: [\"" + $img + "\"], k: 3)"),
286
+ include_fields: "id,title"
287
+ }]
288
+ }')" | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
289
+
290
+ ### Container logs #############################################################
291
+
292
+ echo ""
293
+ echo "=== Typesense Logs (last 40 lines) ==="
294
+ docker logs --tail 40 ${CONTAINER_NAME}
295
+
296
+ echo ""
297
+ echo "=== VERIFICATION ==="
298
+ echo "Expect text query 'a red dress' to put id=1 (Red cocktail dress) first."
299
+ echo "Expect text query 'comfortable shoes' to put id=3 (White leather sneakers) first."
300
+ echo "Expect image>image query with red-dress.jpg to put id=1 first."
fashion-clip-vit-b-p32/_reproducer/verify_onnx.py ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ End-to-end sanity test for the FashionCLIP ONNX export.
3
+
4
+ Replicates the exact I/O contract Typesense uses:
5
+ - model.onnx text path: input_ids (int64) + pixel_values (float32 [1,3,224,224] dummy 0.5) + attention_mask
6
+ -> read 2D [-1, 512] output (text_embeds)
7
+ - model.onnx image path: input_ids dummy + pixel_values + attention_mask
8
+ -> read image_embeds [B, 512]
9
+ - clip_image_processor.onnx: image bytes (uint8 [N]) -> last_hidden_state [1,3,224,224]
10
+
11
+ Also cross-checks ONNX outputs against the original HF transformers FashionCLIP forward pass
12
+ to confirm the export preserves semantics (cosine similarity > 0.999).
13
+ """
14
+
15
+ import io
16
+ import os
17
+ import sys
18
+ import hashlib
19
+
20
+ import numpy as np
21
+ import onnxruntime as ort
22
+ from PIL import Image
23
+ import requests
24
+ from transformers import CLIPModel, CLIPProcessor
25
+ import torch
26
+
27
+ STAGING = os.path.join(os.path.dirname(__file__), "out-staging", "fashion-clip-vit-b-p32")
28
+ SRC = os.path.join(os.path.dirname(__file__), "fashion-clip-src")
29
+ MODEL_ONNX = os.path.join(STAGING, "model.onnx")
30
+ PROCESSOR_ONNX = os.path.join(STAGING, "clip_image_processor.onnx")
31
+
32
+ SAMPLE_IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg" # two cats
33
+ SAMPLE_TEXTS = ["a photo of a cat", "a red dress", "blue denim jeans"]
34
+
35
+
36
+ def fail(msg):
37
+ print(f"FAIL: {msg}")
38
+ sys.exit(1)
39
+
40
+
41
+ def passed(msg):
42
+ print(f"OK : {msg}")
43
+
44
+
45
+ def get_sample_image_bytes():
46
+ cache = os.path.join(os.path.dirname(__file__), "sample.jpg")
47
+ if not os.path.exists(cache):
48
+ r = requests.get(SAMPLE_IMAGE_URL, timeout=30)
49
+ r.raise_for_status()
50
+ with open(cache, "wb") as f:
51
+ f.write(r.content)
52
+ return open(cache, "rb").read()
53
+
54
+
55
+ def main():
56
+ print("=== Inspecting model.onnx ===")
57
+ sess = ort.InferenceSession(MODEL_ONNX, providers=["CPUExecutionProvider"])
58
+ in_names = [i.name for i in sess.get_inputs()]
59
+ out_names = [o.name for o in sess.get_outputs()]
60
+ print("inputs :", in_names)
61
+ print("outputs:", out_names)
62
+ assert "input_ids" in in_names, "missing input_ids"
63
+ assert "pixel_values" in in_names, "missing pixel_values"
64
+ assert "attention_mask" in in_names, "missing attention_mask"
65
+ assert "text_embeds" in out_names, "missing text_embeds"
66
+ assert "image_embeds" in out_names, "missing image_embeds"
67
+ # Typesense scans outputs and picks the first 2D [-1, N>0] one as the embedding output.
68
+ # logits_per_* have both dims dynamic, so they get skipped; text_embeds should be the pick.
69
+ typesense_chosen = None
70
+ for o in sess.get_outputs():
71
+ shp = o.shape
72
+ if len(shp) == 2 and shp[0] in (None, "text_batch_size", "image_batch_size", -1) and isinstance(shp[1], int) and shp[1] > 0:
73
+ typesense_chosen = (o.name, shp)
74
+ break
75
+ print("typesense would pick:", typesense_chosen)
76
+ if typesense_chosen[0] != "text_embeds":
77
+ fail(f"expected text_embeds to be the first 2D embedding output, got {typesense_chosen}")
78
+ passed("model.onnx I/O matches Typesense expectations")
79
+
80
+ print()
81
+ print("=== Inspecting clip_image_processor.onnx ===")
82
+ from onnxruntime_extensions import get_library_path
83
+ proc_opts = ort.SessionOptions()
84
+ proc_opts.register_custom_ops_library(get_library_path())
85
+ proc_sess = ort.InferenceSession(PROCESSOR_ONNX, sess_options=proc_opts, providers=["CPUExecutionProvider"])
86
+ pin = [(i.name, i.type, i.shape) for i in proc_sess.get_inputs()]
87
+ pout = [(o.name, o.type, o.shape) for o in proc_sess.get_outputs()]
88
+ print("processor inputs :", pin)
89
+ print("processor outputs:", pout)
90
+ assert pin[0][0] == "image", "processor expects input named 'image'"
91
+ assert pout[0][0] == "last_hidden_state", "processor expects output 'last_hidden_state'"
92
+ passed("clip_image_processor.onnx I/O matches Typesense expectations")
93
+
94
+ print()
95
+ print("=== Running image processor on a real image ===")
96
+ img_bytes = get_sample_image_bytes()
97
+ img_arr = np.frombuffer(img_bytes, dtype=np.uint8)
98
+ proc_out = proc_sess.run(["last_hidden_state"], {"image": img_arr})[0]
99
+ print("processor output shape:", proc_out.shape, "dtype:", proc_out.dtype)
100
+ if proc_out.shape != (1, 3, 224, 224):
101
+ fail(f"expected (1,3,224,224), got {proc_out.shape}")
102
+ passed("image processor returns (1,3,224,224)")
103
+
104
+ print()
105
+ print("=== ONNX vs HF transformers parity check ===")
106
+ # Build reference using the original transformers model.
107
+ hf_model = CLIPModel.from_pretrained(SRC).eval()
108
+ hf_processor = CLIPProcessor.from_pretrained(SRC)
109
+ pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")
110
+
111
+ hf_inputs = hf_processor(text=SAMPLE_TEXTS, images=pil, return_tensors="pt", padding=True)
112
+ with torch.no_grad():
113
+ hf_out = hf_model(**hf_inputs)
114
+ hf_text = hf_out.text_embeds.numpy()
115
+ hf_image = hf_out.image_embeds.numpy()
116
+
117
+ # Now run the same inputs through ONNX model.onnx.
118
+ # Note: ONNX export needs the same pixel_values as HF.
119
+ pixel_values = hf_inputs["pixel_values"].numpy()
120
+ input_ids = hf_inputs["input_ids"].numpy().astype(np.int64)
121
+ attention_mask = hf_inputs["attention_mask"].numpy().astype(np.int64)
122
+
123
+ onnx_out = sess.run(
124
+ ["text_embeds", "image_embeds"],
125
+ {
126
+ "input_ids": input_ids,
127
+ "pixel_values": pixel_values,
128
+ "attention_mask": attention_mask,
129
+ },
130
+ )
131
+ onnx_text, onnx_image = onnx_out
132
+
133
+ def cosine(a, b):
134
+ return float(np.sum(a * b) / (np.linalg.norm(a) * np.linalg.norm(b)))
135
+
136
+ for i, txt in enumerate(SAMPLE_TEXTS):
137
+ c = cosine(hf_text[i], onnx_text[i])
138
+ print(f" text[{i}] '{txt}': cosine(HF, ONNX) = {c:.6f}")
139
+ if c < 0.999:
140
+ fail(f"text_embeds parity too low: {c}")
141
+ c_img = cosine(hf_image[0], onnx_image[0])
142
+ print(f" image: cosine(HF, ONNX) = {c_img:.6f}")
143
+ if c_img < 0.999:
144
+ fail(f"image_embeds parity too low: {c_img}")
145
+ passed("ONNX text+image embeddings match HF reference (cosine > 0.999)")
146
+
147
+ print()
148
+ print("=== Typesense-style text path (dummy pixel_values = 0.5) ===")
149
+ # Typesense's text_embedder fills pixel_values with 0.5 for text-only queries.
150
+ # The text_embeds output should not depend on pixel_values (the towers are independent),
151
+ # so this should match the real-image text embedding.
152
+ dummy_pixels = np.full((1, 3, 224, 224), 0.5, dtype=np.float32)
153
+ typesense_text = sess.run(
154
+ ["text_embeds"],
155
+ {
156
+ "input_ids": input_ids[:1],
157
+ "attention_mask": attention_mask[:1],
158
+ "pixel_values": dummy_pixels,
159
+ },
160
+ )[0]
161
+ c = cosine(typesense_text[0], onnx_text[0])
162
+ print(f" typesense text path vs onnx text path: cosine = {c:.6f}")
163
+ if c < 0.9999:
164
+ fail(f"text path differs when pixel_values change - towers shouldn't be coupled: {c}")
165
+ passed("text_embeds is independent of pixel_values (Typesense text path safe)")
166
+
167
+ print()
168
+ print("=== Typesense-style image embed path (dummy input_ids) ===")
169
+ # image_embedder.cpp passes input_ids shape [1,1] = [[0]] alongside pixel_values.
170
+ dummy_ids = np.array([[0]], dtype=np.int64)
171
+ dummy_mask = np.array([[1]], dtype=np.int64)
172
+ typesense_image = sess.run(
173
+ ["image_embeds"],
174
+ {
175
+ "input_ids": dummy_ids,
176
+ "pixel_values": pixel_values[:1],
177
+ "attention_mask": dummy_mask,
178
+ },
179
+ )[0]
180
+ c = cosine(typesense_image[0], onnx_image[0])
181
+ print(f" typesense image path vs reference: cosine = {c:.6f}")
182
+ if c < 0.9999:
183
+ fail(f"image_embeds differs when input_ids change - towers shouldn't be coupled: {c}")
184
+ passed("image_embeds is independent of input_ids (Typesense image path safe)")
185
+
186
+ print()
187
+ print("=== Computing MD5 of model.onnx for config.json ===")
188
+ h = hashlib.md5()
189
+ with open(MODEL_ONNX, "rb") as f:
190
+ for chunk in iter(lambda: f.read(1 << 20), b""):
191
+ h.update(chunk)
192
+ print("model.onnx md5:", h.hexdigest())
193
+
194
+ print()
195
+ print("ALL CHECKS PASSED")
196
+
197
+
198
+ if __name__ == "__main__":
199
+ main()