Attach reproducer + verifier scripts as proof

Browse files

Files changed (3) hide show

fashion-clip-vit-b-p32/_reproducer/README.md +97 -0
fashion-clip-vit-b-p32/_reproducer/reproduce.sh +300 -0
fashion-clip-vit-b-p32/_reproducer/verify_onnx.py +199 -0

fashion-clip-vit-b-p32/_reproducer/README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# FashionCLIP ONNX export for Typesense
+Builds an ONNX-format copy of [patrickjohncyh/fashion-clip](https://huggingface.co/patrickjohncyh/fashion-clip) packaged in the layout Typesense expects, and verifies it locally before uploading to [typesense/models-moved](https://huggingface.co/typesense/models-moved).
+FashionCLIP is a CLIP-ViT-B/32 fine-tuned on Farfetch fashion images. Drop-in replacement for `ts/clip-vit-b-p32` when the corpus is fashion/apparel.
+## Why this exists
+Typesense ships first-party support for OpenAI CLIP ViT-B/32. FashionCLIP uses the identical architecture but is fine-tuned on fashion data, which materially improves retrieval on apparel datasets. Hosting it on `typesense/models-moved` lets Cloud users select it with `model_name: "ts/fashion-clip-vit-b-p32"`.
+Because the architecture matches CLIP ViT-B/32 exactly, the `vocab.txt` (CLIP BPE merges) and `clip_image_processor.onnx` files are reused byte-for-byte from `clip-vit-b-p32`. Only `model.onnx` is new.
+## Files in this directory
+| File | Purpose |
+|---|---|
+| `verify_onnx.py` | Exports nothing - just runs the staged `model.onnx` and `clip_image_processor.onnx` through onnxruntime, compares against HF transformers reference (cosine should be 1.0), and validates Typesense's exact I/O contract (input/output tensor names, shapes, independence of text/image towers). |
+| `export.py` | Re-runnable export script: downloads FashionCLIP, exports `model.onnx` via Optimum, pulls reusable artifacts from the existing Typesense CLIP repo, writes `config.json` with current MD5s. |
+| `reproduce.sh` | Full local Docker test: mounts the staged model into a Typesense 29.0 container, creates a `fashion-items` collection with an image embedding field, indexes 5 fashion images, runs text>image and image>image searches. |
+| `out-staging/fashion-clip-vit-b-p32/` | The exact 5 files to upload to HuggingFace under `typesense/models-moved/fashion-clip-vit-b-p32/`. |
+| `fashion-clip-src/` | Cache of the upstream HF snapshot. Gitignored. |
+## What ships in the HF bundle
+```
+fashion-clip-vit-b-p32/
+├── model.onnx                 # FashionCLIP weights (~578 MB)
+├── clip_tokenizer.onnx        # Reused from clip-vit-b-p32 (1.3 MB)
+├── clip_image_processor.onnx  # Reused from clip-vit-b-p32 (~4 KB)
+├── vocab.txt                  # CLIP BPE merges, reused (3.0 MB)
+└── config.json                # Typesense metadata + MD5 checksums
+```
+The HF README at `typesense/models-moved/fashion-clip-vit-b-p32/README.md` should credit `patrickjohncyh/fashion-clip` upstream.
+## Verification status
+Run from this directory:
+```powershell
+# 1. ONNX runtime sanity (no Docker required)
+.venv/Scripts/python.exe verify_onnx.py
+# 2. Live Typesense Docker test (requires working Docker)
+bash reproduce.sh
+```
+**ONNX sanity (verify_onnx.py)** confirms:
+- `model.onnx` has the inputs Typesense uses (`input_ids`, `pixel_values`, `attention_mask`)
+- `model.onnx` outputs `text_embeds` and `image_embeds` at `[B, 512]`
+- The first 2D output Typesense's auto-detect would pick is `text_embeds` (the dynamic-dim `logits_per_*` outputs get skipped, matching `clip-vit-b-p32` behavior)
+- `clip_image_processor.onnx` accepts raw image bytes and returns `(1, 3, 224, 224)`
+- ONNX outputs match the HF transformers reference at cosine = 1.0 for both text and image
+- The text tower is independent of `pixel_values` (Typesense's text path uses dummy 0.5 pixels)
+- The image tower is independent of `input_ids` (Typesense's image path uses dummy `[[0]]` ids)
+**Live Docker test (reproduce.sh)** confirms:
+- Typesense loads `fashion-clip-vit-b-p32` from the local models dir without errors
+- The CLIP tokenizer + image processor work end to end
+- Top result for `"a red dress"` is the red cocktail dress
+- Top result for `"comfortable shoes"` is the white sneakers
+- Image>image with `red-dress.jpg` puts the red dress first
+## Upload to HuggingFace (via PR)
+We don't have direct write access to `typesense/models-moved`, so we open a community PR. The `hf` CLI (huggingface_hub >= 0.34) is the supported tool.
+```bash
+hf auth login   # paste a token from https://huggingface.co/settings/tokens
+hf upload typesense/models-moved \
+  ./out-staging/fashion-clip-vit-b-p32 \
+  fashion-clip-vit-b-p32 \
+  --repo-type model \
+  --create-pr \
+  --commit-message "Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)" \
+  --commit-description "FashionCLIP fine-tuned on Farfetch; CLIP-ViT-B/32 architecture so vocab.txt and clip_image_processor.onnx are reused byte-for-byte from clip-vit-b-p32."
+```
+The PR appears under https://huggingface.co/typesense/models-moved/discussions . A repo maintainer reviews and merges. After merge, Typesense Cloud users reference it as `model_name: "ts/fashion-clip-vit-b-p32"`.
+After upload, Typesense Cloud users reference it as:
+```json
+{
+  "name": "embedding",
+  "type": "float[]",
+  "embed": {
+    "from": ["image"],
+    "model_config": {"model_name": "ts/fashion-clip-vit-b-p32"}
+  }
+}
+```
+## License notes
+- FashionCLIP is released under MIT (per the upstream repo card). The ONNX export inherits that license. The HF README on `typesense/models-moved` should preserve the citation:
+  > Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Goncalves, D., Greco, C., Tagliabue, J. *Contrastive language and vision learning of general fashion concepts*, Scientific Reports 12, 18958 (2022).

fashion-clip-vit-b-p32/_reproducer/reproduce.sh ADDED Viewed

	@@ -0,0 +1,300 @@

+#!/bin/bash
+# Reproducer: verify a self exported FashionCLIP ONNX model works end to end in
+# a local Typesense container before pushing it to typesense/models-moved.
+#
+# What this script does:
+#   1. Spins up a Typesense Docker container on port 8108.
+#   2. Mounts ./out staging/fashion clip vit b p32/ into the container's models dir
+#      so Typesense treats it as a LOCAL model (no namespace prefix, no download).
+#   3. Creates a collection with an image embedding field that points at the model.
+#   4. Indexes a tiny set of fashion images.
+#   5. Runs a text>image search ("a red dress") and an image>image search to
+#      confirm the model returns sensible top results.
+#
+# Notes:
+#   - This validates the same load path that Typesense uses for ts/clip vit b p32.
+#   - Once the files are uploaded to https://huggingface.co/typesense/models moved
+#     under fashion clip vit b p32/, Typesense Cloud users can reference it as
+#     model_name: "ts/fashion clip vit b p32". The local mode test here is
+#     functionally equivalent: same code path, same config, same I/O.
+set -e
+# On Git Bash / MSYS2 (Windows), `/data` gets auto-translated to
+# `C:/Program Files/Git/data` which breaks Docker volume mounts and CLI args.
+# Disable that translation for the whole script.
+export MSYS_NO_PATHCONV=1
+export MSYS2_ARG_CONV_EXCL='*'
+### Configuration ##############################################################
+TYPESENSE_API_KEY=xyz
+PORT=8108
+TYPESENSE_HOST=http://localhost:${PORT}
+CONTAINER_NAME=typesense-fashion-clip-onnx
+TYPESENSE_VERSION=29.0
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+STAGING_DIR="${SCRIPT_DIR}/out-staging/fashion-clip-vit-b-p32"
+DATA_DIR="${SCRIPT_DIR}/typesense-data-${CONTAINER_NAME}"
+if [ ! -f "${STAGING_DIR}/model.onnx" ]; then
+  echo "ERROR: staging dir missing model.onnx; run the export step first (verify_onnx.py)"
+  exit 1
+fi
+# Python image helper below uses this.
+export SCRIPT_DIR
+### Cleanup ####################################################################
+cleanup() {
+  echo ""
+  echo "=== Cleanup ==="
+  docker stop ${CONTAINER_NAME} 2>/dev/null || true
+  docker rm ${CONTAINER_NAME} 2>/dev/null || true
+  # Container ran as root and wrote into the mounted data dir; on Windows the
+  # host user can't `rm` those files. Use a throwaway root container to wipe
+  # the dir contents from inside the same UID, then drop the now empty dir.
+  if [ -d "${DATA_DIR}" ]; then
+    if command -v cygpath > /dev/null 2>&1; then
+      _w=$(cygpath -w "${DATA_DIR}")
+    else
+      _w="${DATA_DIR}"
+    fi
+    docker run --rm -v "${_w}:/data" alpine:3 sh -c "rm -rf /data/* /data/.??*" 2>/dev/null || true
+    rmdir "${DATA_DIR}" 2>/dev/null || true
+  fi
+  echo "Cleanup complete"
+}
+trap cleanup EXIT
+### Setup Typesense ############################################################
+echo "=== Setting up Typesense ${TYPESENSE_VERSION} ==="
+docker stop ${CONTAINER_NAME} 2>/dev/null || true
+docker rm ${CONTAINER_NAME} 2>/dev/null || true
+mkdir -p "${DATA_DIR}/models"
+# Place the model files under the Typesense data dir at models/fashion-clip-vit-b-p32/
+# (no ts_ prefix => loaded as a local model rather than a public download).
+cp -R "${STAGING_DIR}" "${DATA_DIR}/models/"
+# Docker on Windows wants a Windows-style host path for -v mounts.
+if command -v cygpath > /dev/null 2>&1; then
+  WIN_DATA_DIR=$(cygpath -w "${DATA_DIR}")
+else
+  WIN_DATA_DIR="${DATA_DIR}"
+fi
+docker run -d \
+  --name ${CONTAINER_NAME} \
+  -p ${PORT}:8108 \
+  -v "${WIN_DATA_DIR}:/data" \
+  typesense/typesense:${TYPESENSE_VERSION} \
+  --data-dir=/data \
+  --api-key=${TYPESENSE_API_KEY} \
+  --enable-cors
+echo "Waiting for Typesense to be ready..."
+# /health returns {"ok": true} only once Raft has elected a leader. Poll for the
+# JSON body rather than just exit code so we don't race with leader election.
+for _ in $(seq 1 60); do
+  body=$(curl -s "${TYPESENSE_HOST}/health" 2>/dev/null || true)
+  if echo "$body" | grep -q '"ok":true'; then break; fi
+  sleep 1
+done
+# Belt and suspenders: also confirm /debug returns ok=true (means Raft is ready
+# enough to accept writes). Up to v29, /health alone can lie under fast restart.
+for _ in $(seq 1 30); do
+  debug_body=$(curl -s -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "${TYPESENSE_HOST}/debug" 2>/dev/null || true)
+  if echo "$debug_body" | grep -q '"state":1'; then break; fi
+  sleep 1
+done
+echo "Typesense is ready"
+wait_for_collection() {
+  local collection=$1
+  local max_wait=${2:-30}
+  local count=0
+  while [ $count -lt $max_wait ]; do
+    if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/collections/${collection}" \
+      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null | grep -q "200"; then
+      return 0
+    fi
+    sleep 1
+    count=$((count + 1))
+  done
+  echo "WARNING: Collection '${collection}' not ready after ${max_wait}s"
+  return 1
+}
+### Create collection ##########################################################
+echo ""
+echo "=== Creating fashion-items collection ==="
+CREATE_RESP=$(curl -s "${TYPESENSE_HOST}/collections" \
+  -X POST \
+  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "fashion-items",
+    "fields": [
+      {"name": "title",  "type": "string"},
+      {"name": "image",  "type": "image", "store": false},
+      {
+        "name": "embedding",
+        "type": "float[]",
+        "embed": {
+          "from": ["image"],
+          "model_config": {"model_name": "fashion-clip-vit-b-p32"}
+        }
+      }
+    ]
+  }')
+echo "create response: $CREATE_RESP"
+if ! echo "$CREATE_RESP" | jq -e '.name' > /dev/null; then
+  echo ""
+  echo "Collection creation FAILED. Recent Typesense logs:"
+  docker logs --tail 80 ${CONTAINER_NAME}
+  exit 1
+fi
+wait_for_collection "fashion-items"
+### Index a small fashion dataset ##############################################
+echo ""
+echo "=== Preparing sample fashion images ==="
+mkdir -p "${SCRIPT_DIR}/sample-images"
+# Python on Windows can't resolve /c/Users style paths, convert to C:\... form.
+if command -v cygpath > /dev/null 2>&1; then
+  WIN_SCRIPT_DIR=$(cygpath -w "${SCRIPT_DIR}")
+else
+  WIN_SCRIPT_DIR="${SCRIPT_DIR}"
+fi
+export SCRIPT_DIR="${WIN_SCRIPT_DIR}"
+# Use Pillow in the project venv to fetch real photos with a sane User-Agent,
+# falling back to solid-color placeholders that still exercise the embedder.
+PY="${SCRIPT_DIR}/.venv/Scripts/python.exe"
+"$PY" - <<'PYEOF'
+import os, urllib.request
+from PIL import Image, ImageDraw
+OUT = os.path.join(os.environ['SCRIPT_DIR'], 'sample-images')
+os.makedirs(OUT, exist_ok=True)
+ITEMS = [
+    ('red-dress',        (200,  20,  40), 'https://images.pexels.com/photos/985635/pexels-photo-985635.jpeg?w=400'),
+    ('blue-jeans',       ( 30,  60, 140), 'https://images.pexels.com/photos/52518/jeans-pants-blue-pocket-52518.jpeg?w=400'),
+    ('white-sneakers',   (240, 240, 240), 'https://images.pexels.com/photos/1102776/pexels-photo-1102776.jpeg?w=400'),
+    ('leather-handbag',  (110,  60,  30), 'https://images.pexels.com/photos/1152077/pexels-photo-1152077.jpeg?w=400'),
+    ('wool-sweater',     (180, 140,  90), 'https://images.pexels.com/photos/1721934/pexels-photo-1721934.jpeg?w=400'),
+]
+for name, color, url in ITEMS:
+    path = os.path.join(OUT, name + '.jpg')
+    if os.path.exists(path) and os.path.getsize(path) > 1024:
+        continue
+    try:
+        req = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
+        with urllib.request.urlopen(req, timeout=20) as r:
+            data = r.read()
+        with open(path,'wb') as f:
+            f.write(data)
+        print(f'  fetched {name}: {len(data)} bytes')
+    except Exception as e:
+        # placeholder
+        img = Image.new('RGB', (400, 400), color)
+        d = ImageDraw.Draw(img)
+        d.text((10, 10), name, fill=(255,255,255) if sum(color)<400 else (0,0,0))
+        img.save(path, 'JPEG', quality=85)
+        print(f'  placeholder {name} (fetch failed: {e})')
+PYEOF
+echo ""
+echo "=== Indexing fashion items ==="
+build_doc() {
+  local id=$1 title=$2 file=$3
+  local b64
+  b64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/${file}.jpg")
+  printf '{"id":"%s","title":"%s","image":"%s"}\n' "$id" "$title" "$b64"
+}
+{
+  build_doc 1 "Red cocktail dress"        red-dress
+  build_doc 2 "Blue denim jeans"          blue-jeans
+  build_doc 3 "White leather sneakers"    white-sneakers
+  build_doc 4 "Brown leather handbag"     leather-handbag
+  build_doc 5 "Knit wool sweater"         wool-sweater
+} | curl -s "${TYPESENSE_HOST}/collections/fashion-items/documents/import?action=create" \
+    -X POST \
+    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
+    -H "Content-Type: text/plain" \
+    --data-binary @-
+echo ""
+### Text>image search ##########################################################
+echo "=== Text>image search: 'a red dress' ==="
+curl -s "${TYPESENSE_HOST}/multi_search" \
+  -X POST \
+  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
+  -d '{
+    "searches": [
+      {
+        "collection": "fashion-items",
+        "q": "a red dress",
+        "query_by": "embedding",
+        "prefix": false,
+        "include_fields": "id,title",
+        "per_page": 3
+      }
+    ]
+  }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
+echo ""
+echo "=== Text>image search: 'comfortable shoes' ==="
+curl -s "${TYPESENSE_HOST}/multi_search" \
+  -X POST \
+  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
+  -d '{
+    "searches": [
+      {
+        "collection": "fashion-items",
+        "q": "comfortable shoes",
+        "query_by": "embedding",
+        "prefix": false,
+        "include_fields": "id,title",
+        "per_page": 3
+      }
+    ]
+  }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
+### Image>image search #########################################################
+echo ""
+echo "=== Image>image search using red-dress.jpg ==="
+QUERY_IMG_B64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/red-dress.jpg")
+curl -s "${TYPESENSE_HOST}/multi_search" \
+  -X POST \
+  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
+  -H "Content-Type: application/json" \
+  -d "$(jq -n --arg img "$QUERY_IMG_B64" '{
+    searches: [{
+      collection: "fashion-items",
+      q: "*",
+      vector_query: ("embedding:([], queries: [\"" + $img + "\"], k: 3)"),
+      include_fields: "id,title"
+    }]
+  }')" | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
+### Container logs #############################################################
+echo ""
+echo "=== Typesense Logs (last 40 lines) ==="
+docker logs --tail 40 ${CONTAINER_NAME}
+echo ""
+echo "=== VERIFICATION ==="
+echo "Expect text query 'a red dress' to put id=1 (Red cocktail dress) first."
+echo "Expect text query 'comfortable shoes' to put id=3 (White leather sneakers) first."
+echo "Expect image>image query with red-dress.jpg to put id=1 first."

fashion-clip-vit-b-p32/_reproducer/verify_onnx.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""
+End-to-end sanity test for the FashionCLIP ONNX export.
+Replicates the exact I/O contract Typesense uses:
+- model.onnx text path: input_ids (int64) + pixel_values (float32 [1,3,224,224] dummy 0.5) + attention_mask
+                        -> read 2D [-1, 512] output (text_embeds)
+- model.onnx image path: input_ids dummy + pixel_values + attention_mask
+                         -> read image_embeds [B, 512]
+- clip_image_processor.onnx: image bytes (uint8 [N]) -> last_hidden_state [1,3,224,224]
+Also cross-checks ONNX outputs against the original HF transformers FashionCLIP forward pass
+to confirm the export preserves semantics (cosine similarity > 0.999).
+"""
+import io
+import os
+import sys
+import hashlib
+import numpy as np
+import onnxruntime as ort
+from PIL import Image
+import requests
+from transformers import CLIPModel, CLIPProcessor
+import torch
+STAGING = os.path.join(os.path.dirname(__file__), "out-staging", "fashion-clip-vit-b-p32")
+SRC = os.path.join(os.path.dirname(__file__), "fashion-clip-src")
+MODEL_ONNX = os.path.join(STAGING, "model.onnx")
+PROCESSOR_ONNX = os.path.join(STAGING, "clip_image_processor.onnx")
+SAMPLE_IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"  # two cats
+SAMPLE_TEXTS = ["a photo of a cat", "a red dress", "blue denim jeans"]
+def fail(msg):
+    print(f"FAIL: {msg}")
+    sys.exit(1)
+def passed(msg):
+    print(f"OK  : {msg}")
+def get_sample_image_bytes():
+    cache = os.path.join(os.path.dirname(__file__), "sample.jpg")
+    if not os.path.exists(cache):
+        r = requests.get(SAMPLE_IMAGE_URL, timeout=30)
+        r.raise_for_status()
+        with open(cache, "wb") as f:
+            f.write(r.content)
+    return open(cache, "rb").read()
+def main():
+    print("=== Inspecting model.onnx ===")
+    sess = ort.InferenceSession(MODEL_ONNX, providers=["CPUExecutionProvider"])
+    in_names = [i.name for i in sess.get_inputs()]
+    out_names = [o.name for o in sess.get_outputs()]
+    print("inputs :", in_names)
+    print("outputs:", out_names)
+    assert "input_ids" in in_names, "missing input_ids"
+    assert "pixel_values" in in_names, "missing pixel_values"
+    assert "attention_mask" in in_names, "missing attention_mask"
+    assert "text_embeds" in out_names, "missing text_embeds"
+    assert "image_embeds" in out_names, "missing image_embeds"
+    # Typesense scans outputs and picks the first 2D [-1, N>0] one as the embedding output.
+    # logits_per_* have both dims dynamic, so they get skipped; text_embeds should be the pick.
+    typesense_chosen = None
+    for o in sess.get_outputs():
+        shp = o.shape
+        if len(shp) == 2 and shp[0] in (None, "text_batch_size", "image_batch_size", -1) and isinstance(shp[1], int) and shp[1] > 0:
+            typesense_chosen = (o.name, shp)
+            break
+    print("typesense would pick:", typesense_chosen)
+    if typesense_chosen[0] != "text_embeds":
+        fail(f"expected text_embeds to be the first 2D embedding output, got {typesense_chosen}")
+    passed("model.onnx I/O matches Typesense expectations")
+    print()
+    print("=== Inspecting clip_image_processor.onnx ===")
+    from onnxruntime_extensions import get_library_path
+    proc_opts = ort.SessionOptions()
+    proc_opts.register_custom_ops_library(get_library_path())
+    proc_sess = ort.InferenceSession(PROCESSOR_ONNX, sess_options=proc_opts, providers=["CPUExecutionProvider"])
+    pin = [(i.name, i.type, i.shape) for i in proc_sess.get_inputs()]
+    pout = [(o.name, o.type, o.shape) for o in proc_sess.get_outputs()]
+    print("processor inputs :", pin)
+    print("processor outputs:", pout)
+    assert pin[0][0] == "image", "processor expects input named 'image'"
+    assert pout[0][0] == "last_hidden_state", "processor expects output 'last_hidden_state'"
+    passed("clip_image_processor.onnx I/O matches Typesense expectations")
+    print()
+    print("=== Running image processor on a real image ===")
+    img_bytes = get_sample_image_bytes()
+    img_arr = np.frombuffer(img_bytes, dtype=np.uint8)
+    proc_out = proc_sess.run(["last_hidden_state"], {"image": img_arr})[0]
+    print("processor output shape:", proc_out.shape, "dtype:", proc_out.dtype)
+    if proc_out.shape != (1, 3, 224, 224):
+        fail(f"expected (1,3,224,224), got {proc_out.shape}")
+    passed("image processor returns (1,3,224,224)")
+    print()
+    print("=== ONNX vs HF transformers parity check ===")
+    # Build reference using the original transformers model.
+    hf_model = CLIPModel.from_pretrained(SRC).eval()
+    hf_processor = CLIPProcessor.from_pretrained(SRC)
+    pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")
+    hf_inputs = hf_processor(text=SAMPLE_TEXTS, images=pil, return_tensors="pt", padding=True)
+    with torch.no_grad():
+        hf_out = hf_model(**hf_inputs)
+    hf_text = hf_out.text_embeds.numpy()
+    hf_image = hf_out.image_embeds.numpy()
+    # Now run the same inputs through ONNX model.onnx.
+    # Note: ONNX export needs the same pixel_values as HF.
+    pixel_values = hf_inputs["pixel_values"].numpy()
+    input_ids = hf_inputs["input_ids"].numpy().astype(np.int64)
+    attention_mask = hf_inputs["attention_mask"].numpy().astype(np.int64)
+    onnx_out = sess.run(
+        ["text_embeds", "image_embeds"],
+        {
+            "input_ids": input_ids,
+            "pixel_values": pixel_values,
+            "attention_mask": attention_mask,
+        },
+    )
+    onnx_text, onnx_image = onnx_out
+    def cosine(a, b):
+        return float(np.sum(a * b) / (np.linalg.norm(a) * np.linalg.norm(b)))
+    for i, txt in enumerate(SAMPLE_TEXTS):
+        c = cosine(hf_text[i], onnx_text[i])
+        print(f"  text[{i}] '{txt}': cosine(HF, ONNX) = {c:.6f}")
+        if c < 0.999:
+            fail(f"text_embeds parity too low: {c}")
+    c_img = cosine(hf_image[0], onnx_image[0])
+    print(f"  image: cosine(HF, ONNX) = {c_img:.6f}")
+    if c_img < 0.999:
+        fail(f"image_embeds parity too low: {c_img}")
+    passed("ONNX text+image embeddings match HF reference (cosine > 0.999)")
+    print()
+    print("=== Typesense-style text path (dummy pixel_values = 0.5) ===")
+    # Typesense's text_embedder fills pixel_values with 0.5 for text-only queries.
+    # The text_embeds output should not depend on pixel_values (the towers are independent),
+    # so this should match the real-image text embedding.
+    dummy_pixels = np.full((1, 3, 224, 224), 0.5, dtype=np.float32)
+    typesense_text = sess.run(
+        ["text_embeds"],
+        {
+            "input_ids": input_ids[:1],
+            "attention_mask": attention_mask[:1],
+            "pixel_values": dummy_pixels,
+        },
+    )[0]
+    c = cosine(typesense_text[0], onnx_text[0])
+    print(f"  typesense text path vs onnx text path: cosine = {c:.6f}")
+    if c < 0.9999:
+        fail(f"text path differs when pixel_values change - towers shouldn't be coupled: {c}")
+    passed("text_embeds is independent of pixel_values (Typesense text path safe)")
+    print()
+    print("=== Typesense-style image embed path (dummy input_ids) ===")
+    # image_embedder.cpp passes input_ids shape [1,1] = [[0]] alongside pixel_values.
+    dummy_ids = np.array([[0]], dtype=np.int64)
+    dummy_mask = np.array([[1]], dtype=np.int64)
+    typesense_image = sess.run(
+        ["image_embeds"],
+        {
+            "input_ids": dummy_ids,
+            "pixel_values": pixel_values[:1],
+            "attention_mask": dummy_mask,
+        },
+    )[0]
+    c = cosine(typesense_image[0], onnx_image[0])
+    print(f"  typesense image path vs reference: cosine = {c:.6f}")
+    if c < 0.9999:
+        fail(f"image_embeds differs when input_ids change - towers shouldn't be coupled: {c}")
+    passed("image_embeds is independent of input_ids (Typesense image path safe)")
+    print()
+    print("=== Computing MD5 of model.onnx for config.json ===")
+    h = hashlib.md5()
+    with open(MODEL_ONNX, "rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    print("model.onnx md5:", h.hexdigest())
+    print()
+    print("ALL CHECKS PASSED")
+if __name__ == "__main__":
+    main()