alangmartini commited on
Commit
a094735
·
verified ·
1 Parent(s): bd847c9

Remove _reproducer; scripts will live in PR description instead

Browse files
fashion-clip-vit-b-p32/_reproducer/README.md DELETED
@@ -1,97 +0,0 @@
1
- # FashionCLIP ONNX export for Typesense
2
-
3
- Builds an ONNX-format copy of [patrickjohncyh/fashion-clip](https://huggingface.co/patrickjohncyh/fashion-clip) packaged in the layout Typesense expects, and verifies it locally before uploading to [typesense/models-moved](https://huggingface.co/typesense/models-moved).
4
-
5
- FashionCLIP is a CLIP-ViT-B/32 fine-tuned on Farfetch fashion images. Drop-in replacement for `ts/clip-vit-b-p32` when the corpus is fashion/apparel.
6
-
7
- ## Why this exists
8
-
9
- Typesense ships first-party support for OpenAI CLIP ViT-B/32. FashionCLIP uses the identical architecture but is fine-tuned on fashion data, which materially improves retrieval on apparel datasets. Hosting it on `typesense/models-moved` lets Cloud users select it with `model_name: "ts/fashion-clip-vit-b-p32"`.
10
-
11
- Because the architecture matches CLIP ViT-B/32 exactly, the `vocab.txt` (CLIP BPE merges) and `clip_image_processor.onnx` files are reused byte-for-byte from `clip-vit-b-p32`. Only `model.onnx` is new.
12
-
13
- ## Files in this directory
14
-
15
- | File | Purpose |
16
- |---|---|
17
- | `verify_onnx.py` | Exports nothing - just runs the staged `model.onnx` and `clip_image_processor.onnx` through onnxruntime, compares against HF transformers reference (cosine should be 1.0), and validates Typesense's exact I/O contract (input/output tensor names, shapes, independence of text/image towers). |
18
- | `export.py` | Re-runnable export script: downloads FashionCLIP, exports `model.onnx` via Optimum, pulls reusable artifacts from the existing Typesense CLIP repo, writes `config.json` with current MD5s. |
19
- | `reproduce.sh` | Full local Docker test: mounts the staged model into a Typesense 29.0 container, creates a `fashion-items` collection with an image embedding field, indexes 5 fashion images, runs text>image and image>image searches. |
20
- | `out-staging/fashion-clip-vit-b-p32/` | The exact 5 files to upload to HuggingFace under `typesense/models-moved/fashion-clip-vit-b-p32/`. |
21
- | `fashion-clip-src/` | Cache of the upstream HF snapshot. Gitignored. |
22
-
23
- ## What ships in the HF bundle
24
-
25
- ```
26
- fashion-clip-vit-b-p32/
27
- ├── model.onnx # FashionCLIP weights (~578 MB)
28
- ├── clip_tokenizer.onnx # Reused from clip-vit-b-p32 (1.3 MB)
29
- ├── clip_image_processor.onnx # Reused from clip-vit-b-p32 (~4 KB)
30
- ├── vocab.txt # CLIP BPE merges, reused (3.0 MB)
31
- └── config.json # Typesense metadata + MD5 checksums
32
- ```
33
-
34
- The HF README at `typesense/models-moved/fashion-clip-vit-b-p32/README.md` should credit `patrickjohncyh/fashion-clip` upstream.
35
-
36
- ## Verification status
37
-
38
- Run from this directory:
39
-
40
- ```powershell
41
- # 1. ONNX runtime sanity (no Docker required)
42
- .venv/Scripts/python.exe verify_onnx.py
43
-
44
- # 2. Live Typesense Docker test (requires working Docker)
45
- bash reproduce.sh
46
- ```
47
-
48
- **ONNX sanity (verify_onnx.py)** confirms:
49
- - `model.onnx` has the inputs Typesense uses (`input_ids`, `pixel_values`, `attention_mask`)
50
- - `model.onnx` outputs `text_embeds` and `image_embeds` at `[B, 512]`
51
- - The first 2D output Typesense's auto-detect would pick is `text_embeds` (the dynamic-dim `logits_per_*` outputs get skipped, matching `clip-vit-b-p32` behavior)
52
- - `clip_image_processor.onnx` accepts raw image bytes and returns `(1, 3, 224, 224)`
53
- - ONNX outputs match the HF transformers reference at cosine = 1.0 for both text and image
54
- - The text tower is independent of `pixel_values` (Typesense's text path uses dummy 0.5 pixels)
55
- - The image tower is independent of `input_ids` (Typesense's image path uses dummy `[[0]]` ids)
56
-
57
- **Live Docker test (reproduce.sh)** confirms:
58
- - Typesense loads `fashion-clip-vit-b-p32` from the local models dir without errors
59
- - The CLIP tokenizer + image processor work end to end
60
- - Top result for `"a red dress"` is the red cocktail dress
61
- - Top result for `"comfortable shoes"` is the white sneakers
62
- - Image>image with `red-dress.jpg` puts the red dress first
63
-
64
- ## Upload to HuggingFace (via PR)
65
-
66
- We don't have direct write access to `typesense/models-moved`, so we open a community PR. The `hf` CLI (huggingface_hub >= 0.34) is the supported tool.
67
-
68
- ```bash
69
- hf auth login # paste a token from https://huggingface.co/settings/tokens
70
- hf upload typesense/models-moved \
71
- ./out-staging/fashion-clip-vit-b-p32 \
72
- fashion-clip-vit-b-p32 \
73
- --repo-type model \
74
- --create-pr \
75
- --commit-message "Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)" \
76
- --commit-description "FashionCLIP fine-tuned on Farfetch; CLIP-ViT-B/32 architecture so vocab.txt and clip_image_processor.onnx are reused byte-for-byte from clip-vit-b-p32."
77
- ```
78
-
79
- The PR appears under https://huggingface.co/typesense/models-moved/discussions . A repo maintainer reviews and merges. After merge, Typesense Cloud users reference it as `model_name: "ts/fashion-clip-vit-b-p32"`.
80
-
81
- After upload, Typesense Cloud users reference it as:
82
-
83
- ```json
84
- {
85
- "name": "embedding",
86
- "type": "float[]",
87
- "embed": {
88
- "from": ["image"],
89
- "model_config": {"model_name": "ts/fashion-clip-vit-b-p32"}
90
- }
91
- }
92
- ```
93
-
94
- ## License notes
95
-
96
- - FashionCLIP is released under MIT (per the upstream repo card). The ONNX export inherits that license. The HF README on `typesense/models-moved` should preserve the citation:
97
- > Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Goncalves, D., Greco, C., Tagliabue, J. *Contrastive language and vision learning of general fashion concepts*, Scientific Reports 12, 18958 (2022).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fashion-clip-vit-b-p32/_reproducer/reproduce.sh DELETED
@@ -1,300 +0,0 @@
1
- #!/bin/bash
2
- # Reproducer: verify a self exported FashionCLIP ONNX model works end to end in
3
- # a local Typesense container before pushing it to typesense/models-moved.
4
- #
5
- # What this script does:
6
- # 1. Spins up a Typesense Docker container on port 8108.
7
- # 2. Mounts ./out staging/fashion clip vit b p32/ into the container's models dir
8
- # so Typesense treats it as a LOCAL model (no namespace prefix, no download).
9
- # 3. Creates a collection with an image embedding field that points at the model.
10
- # 4. Indexes a tiny set of fashion images.
11
- # 5. Runs a text>image search ("a red dress") and an image>image search to
12
- # confirm the model returns sensible top results.
13
- #
14
- # Notes:
15
- # - This validates the same load path that Typesense uses for ts/clip vit b p32.
16
- # - Once the files are uploaded to https://huggingface.co/typesense/models moved
17
- # under fashion clip vit b p32/, Typesense Cloud users can reference it as
18
- # model_name: "ts/fashion clip vit b p32". The local mode test here is
19
- # functionally equivalent: same code path, same config, same I/O.
20
-
21
- set -e
22
-
23
- # On Git Bash / MSYS2 (Windows), `/data` gets auto-translated to
24
- # `C:/Program Files/Git/data` which breaks Docker volume mounts and CLI args.
25
- # Disable that translation for the whole script.
26
- export MSYS_NO_PATHCONV=1
27
- export MSYS2_ARG_CONV_EXCL='*'
28
-
29
- ### Configuration ##############################################################
30
-
31
- TYPESENSE_API_KEY=xyz
32
- PORT=8108
33
- TYPESENSE_HOST=http://localhost:${PORT}
34
- CONTAINER_NAME=typesense-fashion-clip-onnx
35
- TYPESENSE_VERSION=29.0
36
- SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
37
- STAGING_DIR="${SCRIPT_DIR}/out-staging/fashion-clip-vit-b-p32"
38
- DATA_DIR="${SCRIPT_DIR}/typesense-data-${CONTAINER_NAME}"
39
-
40
- if [ ! -f "${STAGING_DIR}/model.onnx" ]; then
41
- echo "ERROR: staging dir missing model.onnx; run the export step first (verify_onnx.py)"
42
- exit 1
43
- fi
44
-
45
- # Python image helper below uses this.
46
- export SCRIPT_DIR
47
-
48
- ### Cleanup ####################################################################
49
-
50
- cleanup() {
51
- echo ""
52
- echo "=== Cleanup ==="
53
- docker stop ${CONTAINER_NAME} 2>/dev/null || true
54
- docker rm ${CONTAINER_NAME} 2>/dev/null || true
55
- # Container ran as root and wrote into the mounted data dir; on Windows the
56
- # host user can't `rm` those files. Use a throwaway root container to wipe
57
- # the dir contents from inside the same UID, then drop the now empty dir.
58
- if [ -d "${DATA_DIR}" ]; then
59
- if command -v cygpath > /dev/null 2>&1; then
60
- _w=$(cygpath -w "${DATA_DIR}")
61
- else
62
- _w="${DATA_DIR}"
63
- fi
64
- docker run --rm -v "${_w}:/data" alpine:3 sh -c "rm -rf /data/* /data/.??*" 2>/dev/null || true
65
- rmdir "${DATA_DIR}" 2>/dev/null || true
66
- fi
67
- echo "Cleanup complete"
68
- }
69
- trap cleanup EXIT
70
-
71
- ### Setup Typesense ############################################################
72
-
73
- echo "=== Setting up Typesense ${TYPESENSE_VERSION} ==="
74
- docker stop ${CONTAINER_NAME} 2>/dev/null || true
75
- docker rm ${CONTAINER_NAME} 2>/dev/null || true
76
-
77
- mkdir -p "${DATA_DIR}/models"
78
- # Place the model files under the Typesense data dir at models/fashion-clip-vit-b-p32/
79
- # (no ts_ prefix => loaded as a local model rather than a public download).
80
- cp -R "${STAGING_DIR}" "${DATA_DIR}/models/"
81
-
82
- # Docker on Windows wants a Windows-style host path for -v mounts.
83
- if command -v cygpath > /dev/null 2>&1; then
84
- WIN_DATA_DIR=$(cygpath -w "${DATA_DIR}")
85
- else
86
- WIN_DATA_DIR="${DATA_DIR}"
87
- fi
88
-
89
- docker run -d \
90
- --name ${CONTAINER_NAME} \
91
- -p ${PORT}:8108 \
92
- -v "${WIN_DATA_DIR}:/data" \
93
- typesense/typesense:${TYPESENSE_VERSION} \
94
- --data-dir=/data \
95
- --api-key=${TYPESENSE_API_KEY} \
96
- --enable-cors
97
-
98
- echo "Waiting for Typesense to be ready..."
99
- # /health returns {"ok": true} only once Raft has elected a leader. Poll for the
100
- # JSON body rather than just exit code so we don't race with leader election.
101
- for _ in $(seq 1 60); do
102
- body=$(curl -s "${TYPESENSE_HOST}/health" 2>/dev/null || true)
103
- if echo "$body" | grep -q '"ok":true'; then break; fi
104
- sleep 1
105
- done
106
- # Belt and suspenders: also confirm /debug returns ok=true (means Raft is ready
107
- # enough to accept writes). Up to v29, /health alone can lie under fast restart.
108
- for _ in $(seq 1 30); do
109
- debug_body=$(curl -s -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "${TYPESENSE_HOST}/debug" 2>/dev/null || true)
110
- if echo "$debug_body" | grep -q '"state":1'; then break; fi
111
- sleep 1
112
- done
113
- echo "Typesense is ready"
114
-
115
- wait_for_collection() {
116
- local collection=$1
117
- local max_wait=${2:-30}
118
- local count=0
119
- while [ $count -lt $max_wait ]; do
120
- if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/collections/${collection}" \
121
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null | grep -q "200"; then
122
- return 0
123
- fi
124
- sleep 1
125
- count=$((count + 1))
126
- done
127
- echo "WARNING: Collection '${collection}' not ready after ${max_wait}s"
128
- return 1
129
- }
130
-
131
- ### Create collection ##########################################################
132
-
133
- echo ""
134
- echo "=== Creating fashion-items collection ==="
135
-
136
- CREATE_RESP=$(curl -s "${TYPESENSE_HOST}/collections" \
137
- -X POST \
138
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
139
- -H "Content-Type: application/json" \
140
- -d '{
141
- "name": "fashion-items",
142
- "fields": [
143
- {"name": "title", "type": "string"},
144
- {"name": "image", "type": "image", "store": false},
145
- {
146
- "name": "embedding",
147
- "type": "float[]",
148
- "embed": {
149
- "from": ["image"],
150
- "model_config": {"model_name": "fashion-clip-vit-b-p32"}
151
- }
152
- }
153
- ]
154
- }')
155
- echo "create response: $CREATE_RESP"
156
- if ! echo "$CREATE_RESP" | jq -e '.name' > /dev/null; then
157
- echo ""
158
- echo "Collection creation FAILED. Recent Typesense logs:"
159
- docker logs --tail 80 ${CONTAINER_NAME}
160
- exit 1
161
- fi
162
-
163
- wait_for_collection "fashion-items"
164
-
165
- ### Index a small fashion dataset ##############################################
166
-
167
- echo ""
168
- echo "=== Preparing sample fashion images ==="
169
- mkdir -p "${SCRIPT_DIR}/sample-images"
170
- # Python on Windows can't resolve /c/Users style paths, convert to C:\... form.
171
- if command -v cygpath > /dev/null 2>&1; then
172
- WIN_SCRIPT_DIR=$(cygpath -w "${SCRIPT_DIR}")
173
- else
174
- WIN_SCRIPT_DIR="${SCRIPT_DIR}"
175
- fi
176
- export SCRIPT_DIR="${WIN_SCRIPT_DIR}"
177
- # Use Pillow in the project venv to fetch real photos with a sane User-Agent,
178
- # falling back to solid-color placeholders that still exercise the embedder.
179
- PY="${SCRIPT_DIR}/.venv/Scripts/python.exe"
180
- "$PY" - <<'PYEOF'
181
- import os, urllib.request
182
- from PIL import Image, ImageDraw
183
-
184
- OUT = os.path.join(os.environ['SCRIPT_DIR'], 'sample-images')
185
- os.makedirs(OUT, exist_ok=True)
186
- ITEMS = [
187
- ('red-dress', (200, 20, 40), 'https://images.pexels.com/photos/985635/pexels-photo-985635.jpeg?w=400'),
188
- ('blue-jeans', ( 30, 60, 140), 'https://images.pexels.com/photos/52518/jeans-pants-blue-pocket-52518.jpeg?w=400'),
189
- ('white-sneakers', (240, 240, 240), 'https://images.pexels.com/photos/1102776/pexels-photo-1102776.jpeg?w=400'),
190
- ('leather-handbag', (110, 60, 30), 'https://images.pexels.com/photos/1152077/pexels-photo-1152077.jpeg?w=400'),
191
- ('wool-sweater', (180, 140, 90), 'https://images.pexels.com/photos/1721934/pexels-photo-1721934.jpeg?w=400'),
192
- ]
193
- for name, color, url in ITEMS:
194
- path = os.path.join(OUT, name + '.jpg')
195
- if os.path.exists(path) and os.path.getsize(path) > 1024:
196
- continue
197
- try:
198
- req = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
199
- with urllib.request.urlopen(req, timeout=20) as r:
200
- data = r.read()
201
- with open(path,'wb') as f:
202
- f.write(data)
203
- print(f' fetched {name}: {len(data)} bytes')
204
- except Exception as e:
205
- # placeholder
206
- img = Image.new('RGB', (400, 400), color)
207
- d = ImageDraw.Draw(img)
208
- d.text((10, 10), name, fill=(255,255,255) if sum(color)<400 else (0,0,0))
209
- img.save(path, 'JPEG', quality=85)
210
- print(f' placeholder {name} (fetch failed: {e})')
211
- PYEOF
212
-
213
- echo ""
214
- echo "=== Indexing fashion items ==="
215
- build_doc() {
216
- local id=$1 title=$2 file=$3
217
- local b64
218
- b64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/${file}.jpg")
219
- printf '{"id":"%s","title":"%s","image":"%s"}\n' "$id" "$title" "$b64"
220
- }
221
- {
222
- build_doc 1 "Red cocktail dress" red-dress
223
- build_doc 2 "Blue denim jeans" blue-jeans
224
- build_doc 3 "White leather sneakers" white-sneakers
225
- build_doc 4 "Brown leather handbag" leather-handbag
226
- build_doc 5 "Knit wool sweater" wool-sweater
227
- } | curl -s "${TYPESENSE_HOST}/collections/fashion-items/documents/import?action=create" \
228
- -X POST \
229
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
230
- -H "Content-Type: text/plain" \
231
- --data-binary @-
232
-
233
- echo ""
234
-
235
- ### Text>image search ##########################################################
236
-
237
- echo "=== Text>image search: 'a red dress' ==="
238
- curl -s "${TYPESENSE_HOST}/multi_search" \
239
- -X POST \
240
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
241
- -d '{
242
- "searches": [
243
- {
244
- "collection": "fashion-items",
245
- "q": "a red dress",
246
- "query_by": "embedding",
247
- "prefix": false,
248
- "include_fields": "id,title",
249
- "per_page": 3
250
- }
251
- ]
252
- }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
253
-
254
- echo ""
255
- echo "=== Text>image search: 'comfortable shoes' ==="
256
- curl -s "${TYPESENSE_HOST}/multi_search" \
257
- -X POST \
258
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
259
- -d '{
260
- "searches": [
261
- {
262
- "collection": "fashion-items",
263
- "q": "comfortable shoes",
264
- "query_by": "embedding",
265
- "prefix": false,
266
- "include_fields": "id,title",
267
- "per_page": 3
268
- }
269
- ]
270
- }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
271
-
272
- ### Image>image search #########################################################
273
-
274
- echo ""
275
- echo "=== Image>image search using red-dress.jpg ==="
276
- QUERY_IMG_B64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/red-dress.jpg")
277
- curl -s "${TYPESENSE_HOST}/multi_search" \
278
- -X POST \
279
- -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
280
- -H "Content-Type: application/json" \
281
- -d "$(jq -n --arg img "$QUERY_IMG_B64" '{
282
- searches: [{
283
- collection: "fashion-items",
284
- q: "*",
285
- vector_query: ("embedding:([], queries: [\"" + $img + "\"], k: 3)"),
286
- include_fields: "id,title"
287
- }]
288
- }')" | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
289
-
290
- ### Container logs #############################################################
291
-
292
- echo ""
293
- echo "=== Typesense Logs (last 40 lines) ==="
294
- docker logs --tail 40 ${CONTAINER_NAME}
295
-
296
- echo ""
297
- echo "=== VERIFICATION ==="
298
- echo "Expect text query 'a red dress' to put id=1 (Red cocktail dress) first."
299
- echo "Expect text query 'comfortable shoes' to put id=3 (White leather sneakers) first."
300
- echo "Expect image>image query with red-dress.jpg to put id=1 first."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fashion-clip-vit-b-p32/_reproducer/verify_onnx.py DELETED
@@ -1,199 +0,0 @@
1
- """
2
- End-to-end sanity test for the FashionCLIP ONNX export.
3
-
4
- Replicates the exact I/O contract Typesense uses:
5
- - model.onnx text path: input_ids (int64) + pixel_values (float32 [1,3,224,224] dummy 0.5) + attention_mask
6
- -> read 2D [-1, 512] output (text_embeds)
7
- - model.onnx image path: input_ids dummy + pixel_values + attention_mask
8
- -> read image_embeds [B, 512]
9
- - clip_image_processor.onnx: image bytes (uint8 [N]) -> last_hidden_state [1,3,224,224]
10
-
11
- Also cross-checks ONNX outputs against the original HF transformers FashionCLIP forward pass
12
- to confirm the export preserves semantics (cosine similarity > 0.999).
13
- """
14
-
15
- import io
16
- import os
17
- import sys
18
- import hashlib
19
-
20
- import numpy as np
21
- import onnxruntime as ort
22
- from PIL import Image
23
- import requests
24
- from transformers import CLIPModel, CLIPProcessor
25
- import torch
26
-
27
- STAGING = os.path.join(os.path.dirname(__file__), "out-staging", "fashion-clip-vit-b-p32")
28
- SRC = os.path.join(os.path.dirname(__file__), "fashion-clip-src")
29
- MODEL_ONNX = os.path.join(STAGING, "model.onnx")
30
- PROCESSOR_ONNX = os.path.join(STAGING, "clip_image_processor.onnx")
31
-
32
- SAMPLE_IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg" # two cats
33
- SAMPLE_TEXTS = ["a photo of a cat", "a red dress", "blue denim jeans"]
34
-
35
-
36
- def fail(msg):
37
- print(f"FAIL: {msg}")
38
- sys.exit(1)
39
-
40
-
41
- def passed(msg):
42
- print(f"OK : {msg}")
43
-
44
-
45
- def get_sample_image_bytes():
46
- cache = os.path.join(os.path.dirname(__file__), "sample.jpg")
47
- if not os.path.exists(cache):
48
- r = requests.get(SAMPLE_IMAGE_URL, timeout=30)
49
- r.raise_for_status()
50
- with open(cache, "wb") as f:
51
- f.write(r.content)
52
- return open(cache, "rb").read()
53
-
54
-
55
- def main():
56
- print("=== Inspecting model.onnx ===")
57
- sess = ort.InferenceSession(MODEL_ONNX, providers=["CPUExecutionProvider"])
58
- in_names = [i.name for i in sess.get_inputs()]
59
- out_names = [o.name for o in sess.get_outputs()]
60
- print("inputs :", in_names)
61
- print("outputs:", out_names)
62
- assert "input_ids" in in_names, "missing input_ids"
63
- assert "pixel_values" in in_names, "missing pixel_values"
64
- assert "attention_mask" in in_names, "missing attention_mask"
65
- assert "text_embeds" in out_names, "missing text_embeds"
66
- assert "image_embeds" in out_names, "missing image_embeds"
67
- # Typesense scans outputs and picks the first 2D [-1, N>0] one as the embedding output.
68
- # logits_per_* have both dims dynamic, so they get skipped; text_embeds should be the pick.
69
- typesense_chosen = None
70
- for o in sess.get_outputs():
71
- shp = o.shape
72
- if len(shp) == 2 and shp[0] in (None, "text_batch_size", "image_batch_size", -1) and isinstance(shp[1], int) and shp[1] > 0:
73
- typesense_chosen = (o.name, shp)
74
- break
75
- print("typesense would pick:", typesense_chosen)
76
- if typesense_chosen[0] != "text_embeds":
77
- fail(f"expected text_embeds to be the first 2D embedding output, got {typesense_chosen}")
78
- passed("model.onnx I/O matches Typesense expectations")
79
-
80
- print()
81
- print("=== Inspecting clip_image_processor.onnx ===")
82
- from onnxruntime_extensions import get_library_path
83
- proc_opts = ort.SessionOptions()
84
- proc_opts.register_custom_ops_library(get_library_path())
85
- proc_sess = ort.InferenceSession(PROCESSOR_ONNX, sess_options=proc_opts, providers=["CPUExecutionProvider"])
86
- pin = [(i.name, i.type, i.shape) for i in proc_sess.get_inputs()]
87
- pout = [(o.name, o.type, o.shape) for o in proc_sess.get_outputs()]
88
- print("processor inputs :", pin)
89
- print("processor outputs:", pout)
90
- assert pin[0][0] == "image", "processor expects input named 'image'"
91
- assert pout[0][0] == "last_hidden_state", "processor expects output 'last_hidden_state'"
92
- passed("clip_image_processor.onnx I/O matches Typesense expectations")
93
-
94
- print()
95
- print("=== Running image processor on a real image ===")
96
- img_bytes = get_sample_image_bytes()
97
- img_arr = np.frombuffer(img_bytes, dtype=np.uint8)
98
- proc_out = proc_sess.run(["last_hidden_state"], {"image": img_arr})[0]
99
- print("processor output shape:", proc_out.shape, "dtype:", proc_out.dtype)
100
- if proc_out.shape != (1, 3, 224, 224):
101
- fail(f"expected (1,3,224,224), got {proc_out.shape}")
102
- passed("image processor returns (1,3,224,224)")
103
-
104
- print()
105
- print("=== ONNX vs HF transformers parity check ===")
106
- # Build reference using the original transformers model.
107
- hf_model = CLIPModel.from_pretrained(SRC).eval()
108
- hf_processor = CLIPProcessor.from_pretrained(SRC)
109
- pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")
110
-
111
- hf_inputs = hf_processor(text=SAMPLE_TEXTS, images=pil, return_tensors="pt", padding=True)
112
- with torch.no_grad():
113
- hf_out = hf_model(**hf_inputs)
114
- hf_text = hf_out.text_embeds.numpy()
115
- hf_image = hf_out.image_embeds.numpy()
116
-
117
- # Now run the same inputs through ONNX model.onnx.
118
- # Note: ONNX export needs the same pixel_values as HF.
119
- pixel_values = hf_inputs["pixel_values"].numpy()
120
- input_ids = hf_inputs["input_ids"].numpy().astype(np.int64)
121
- attention_mask = hf_inputs["attention_mask"].numpy().astype(np.int64)
122
-
123
- onnx_out = sess.run(
124
- ["text_embeds", "image_embeds"],
125
- {
126
- "input_ids": input_ids,
127
- "pixel_values": pixel_values,
128
- "attention_mask": attention_mask,
129
- },
130
- )
131
- onnx_text, onnx_image = onnx_out
132
-
133
- def cosine(a, b):
134
- return float(np.sum(a * b) / (np.linalg.norm(a) * np.linalg.norm(b)))
135
-
136
- for i, txt in enumerate(SAMPLE_TEXTS):
137
- c = cosine(hf_text[i], onnx_text[i])
138
- print(f" text[{i}] '{txt}': cosine(HF, ONNX) = {c:.6f}")
139
- if c < 0.999:
140
- fail(f"text_embeds parity too low: {c}")
141
- c_img = cosine(hf_image[0], onnx_image[0])
142
- print(f" image: cosine(HF, ONNX) = {c_img:.6f}")
143
- if c_img < 0.999:
144
- fail(f"image_embeds parity too low: {c_img}")
145
- passed("ONNX text+image embeddings match HF reference (cosine > 0.999)")
146
-
147
- print()
148
- print("=== Typesense-style text path (dummy pixel_values = 0.5) ===")
149
- # Typesense's text_embedder fills pixel_values with 0.5 for text-only queries.
150
- # The text_embeds output should not depend on pixel_values (the towers are independent),
151
- # so this should match the real-image text embedding.
152
- dummy_pixels = np.full((1, 3, 224, 224), 0.5, dtype=np.float32)
153
- typesense_text = sess.run(
154
- ["text_embeds"],
155
- {
156
- "input_ids": input_ids[:1],
157
- "attention_mask": attention_mask[:1],
158
- "pixel_values": dummy_pixels,
159
- },
160
- )[0]
161
- c = cosine(typesense_text[0], onnx_text[0])
162
- print(f" typesense text path vs onnx text path: cosine = {c:.6f}")
163
- if c < 0.9999:
164
- fail(f"text path differs when pixel_values change - towers shouldn't be coupled: {c}")
165
- passed("text_embeds is independent of pixel_values (Typesense text path safe)")
166
-
167
- print()
168
- print("=== Typesense-style image embed path (dummy input_ids) ===")
169
- # image_embedder.cpp passes input_ids shape [1,1] = [[0]] alongside pixel_values.
170
- dummy_ids = np.array([[0]], dtype=np.int64)
171
- dummy_mask = np.array([[1]], dtype=np.int64)
172
- typesense_image = sess.run(
173
- ["image_embeds"],
174
- {
175
- "input_ids": dummy_ids,
176
- "pixel_values": pixel_values[:1],
177
- "attention_mask": dummy_mask,
178
- },
179
- )[0]
180
- c = cosine(typesense_image[0], onnx_image[0])
181
- print(f" typesense image path vs reference: cosine = {c:.6f}")
182
- if c < 0.9999:
183
- fail(f"image_embeds differs when input_ids change - towers shouldn't be coupled: {c}")
184
- passed("image_embeds is independent of input_ids (Typesense image path safe)")
185
-
186
- print()
187
- print("=== Computing MD5 of model.onnx for config.json ===")
188
- h = hashlib.md5()
189
- with open(MODEL_ONNX, "rb") as f:
190
- for chunk in iter(lambda: f.read(1 << 20), b""):
191
- h.update(chunk)
192
- print("model.onnx md5:", h.hexdigest())
193
-
194
- print()
195
- print("ALL CHECKS PASSED")
196
-
197
-
198
- if __name__ == "__main__":
199
- main()