Attach reproducer + verifier scripts as proof
Browse files
fashion-clip-vit-b-p32/_reproducer/README.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FashionCLIP ONNX export for Typesense
|
| 2 |
+
|
| 3 |
+
Builds an ONNX-format copy of [patrickjohncyh/fashion-clip](https://huggingface.co/patrickjohncyh/fashion-clip) packaged in the layout Typesense expects, and verifies it locally before uploading to [typesense/models-moved](https://huggingface.co/typesense/models-moved).
|
| 4 |
+
|
| 5 |
+
FashionCLIP is a CLIP-ViT-B/32 fine-tuned on Farfetch fashion images. Drop-in replacement for `ts/clip-vit-b-p32` when the corpus is fashion/apparel.
|
| 6 |
+
|
| 7 |
+
## Why this exists
|
| 8 |
+
|
| 9 |
+
Typesense ships first-party support for OpenAI CLIP ViT-B/32. FashionCLIP uses the identical architecture but is fine-tuned on fashion data, which materially improves retrieval on apparel datasets. Hosting it on `typesense/models-moved` lets Cloud users select it with `model_name: "ts/fashion-clip-vit-b-p32"`.
|
| 10 |
+
|
| 11 |
+
Because the architecture matches CLIP ViT-B/32 exactly, the `vocab.txt` (CLIP BPE merges) and `clip_image_processor.onnx` files are reused byte-for-byte from `clip-vit-b-p32`. Only `model.onnx` is new.
|
| 12 |
+
|
| 13 |
+
## Files in this directory
|
| 14 |
+
|
| 15 |
+
| File | Purpose |
|
| 16 |
+
|---|---|
|
| 17 |
+
| `verify_onnx.py` | Exports nothing - just runs the staged `model.onnx` and `clip_image_processor.onnx` through onnxruntime, compares against HF transformers reference (cosine should be 1.0), and validates Typesense's exact I/O contract (input/output tensor names, shapes, independence of text/image towers). |
|
| 18 |
+
| `export.py` | Re-runnable export script: downloads FashionCLIP, exports `model.onnx` via Optimum, pulls reusable artifacts from the existing Typesense CLIP repo, writes `config.json` with current MD5s. |
|
| 19 |
+
| `reproduce.sh` | Full local Docker test: mounts the staged model into a Typesense 29.0 container, creates a `fashion-items` collection with an image embedding field, indexes 5 fashion images, runs text>image and image>image searches. |
|
| 20 |
+
| `out-staging/fashion-clip-vit-b-p32/` | The exact 5 files to upload to HuggingFace under `typesense/models-moved/fashion-clip-vit-b-p32/`. |
|
| 21 |
+
| `fashion-clip-src/` | Cache of the upstream HF snapshot. Gitignored. |
|
| 22 |
+
|
| 23 |
+
## What ships in the HF bundle
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
fashion-clip-vit-b-p32/
|
| 27 |
+
├── model.onnx # FashionCLIP weights (~578 MB)
|
| 28 |
+
├── clip_tokenizer.onnx # Reused from clip-vit-b-p32 (1.3 MB)
|
| 29 |
+
├── clip_image_processor.onnx # Reused from clip-vit-b-p32 (~4 KB)
|
| 30 |
+
├── vocab.txt # CLIP BPE merges, reused (3.0 MB)
|
| 31 |
+
└── config.json # Typesense metadata + MD5 checksums
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
The HF README at `typesense/models-moved/fashion-clip-vit-b-p32/README.md` should credit `patrickjohncyh/fashion-clip` upstream.
|
| 35 |
+
|
| 36 |
+
## Verification status
|
| 37 |
+
|
| 38 |
+
Run from this directory:
|
| 39 |
+
|
| 40 |
+
```powershell
|
| 41 |
+
# 1. ONNX runtime sanity (no Docker required)
|
| 42 |
+
.venv/Scripts/python.exe verify_onnx.py
|
| 43 |
+
|
| 44 |
+
# 2. Live Typesense Docker test (requires working Docker)
|
| 45 |
+
bash reproduce.sh
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
**ONNX sanity (verify_onnx.py)** confirms:
|
| 49 |
+
- `model.onnx` has the inputs Typesense uses (`input_ids`, `pixel_values`, `attention_mask`)
|
| 50 |
+
- `model.onnx` outputs `text_embeds` and `image_embeds` at `[B, 512]`
|
| 51 |
+
- The first 2D output Typesense's auto-detect would pick is `text_embeds` (the dynamic-dim `logits_per_*` outputs get skipped, matching `clip-vit-b-p32` behavior)
|
| 52 |
+
- `clip_image_processor.onnx` accepts raw image bytes and returns `(1, 3, 224, 224)`
|
| 53 |
+
- ONNX outputs match the HF transformers reference at cosine = 1.0 for both text and image
|
| 54 |
+
- The text tower is independent of `pixel_values` (Typesense's text path uses dummy 0.5 pixels)
|
| 55 |
+
- The image tower is independent of `input_ids` (Typesense's image path uses dummy `[[0]]` ids)
|
| 56 |
+
|
| 57 |
+
**Live Docker test (reproduce.sh)** confirms:
|
| 58 |
+
- Typesense loads `fashion-clip-vit-b-p32` from the local models dir without errors
|
| 59 |
+
- The CLIP tokenizer + image processor work end to end
|
| 60 |
+
- Top result for `"a red dress"` is the red cocktail dress
|
| 61 |
+
- Top result for `"comfortable shoes"` is the white sneakers
|
| 62 |
+
- Image>image with `red-dress.jpg` puts the red dress first
|
| 63 |
+
|
| 64 |
+
## Upload to HuggingFace (via PR)
|
| 65 |
+
|
| 66 |
+
We don't have direct write access to `typesense/models-moved`, so we open a community PR. The `hf` CLI (huggingface_hub >= 0.34) is the supported tool.
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
hf auth login # paste a token from https://huggingface.co/settings/tokens
|
| 70 |
+
hf upload typesense/models-moved \
|
| 71 |
+
./out-staging/fashion-clip-vit-b-p32 \
|
| 72 |
+
fashion-clip-vit-b-p32 \
|
| 73 |
+
--repo-type model \
|
| 74 |
+
--create-pr \
|
| 75 |
+
--commit-message "Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)" \
|
| 76 |
+
--commit-description "FashionCLIP fine-tuned on Farfetch; CLIP-ViT-B/32 architecture so vocab.txt and clip_image_processor.onnx are reused byte-for-byte from clip-vit-b-p32."
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
The PR appears under https://huggingface.co/typesense/models-moved/discussions . A repo maintainer reviews and merges. After merge, Typesense Cloud users reference it as `model_name: "ts/fashion-clip-vit-b-p32"`.
|
| 80 |
+
|
| 81 |
+
After upload, Typesense Cloud users reference it as:
|
| 82 |
+
|
| 83 |
+
```json
|
| 84 |
+
{
|
| 85 |
+
"name": "embedding",
|
| 86 |
+
"type": "float[]",
|
| 87 |
+
"embed": {
|
| 88 |
+
"from": ["image"],
|
| 89 |
+
"model_config": {"model_name": "ts/fashion-clip-vit-b-p32"}
|
| 90 |
+
}
|
| 91 |
+
}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## License notes
|
| 95 |
+
|
| 96 |
+
- FashionCLIP is released under MIT (per the upstream repo card). The ONNX export inherits that license. The HF README on `typesense/models-moved` should preserve the citation:
|
| 97 |
+
> Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A.R., Goncalves, D., Greco, C., Tagliabue, J. *Contrastive language and vision learning of general fashion concepts*, Scientific Reports 12, 18958 (2022).
|
fashion-clip-vit-b-p32/_reproducer/reproduce.sh
ADDED
|
@@ -0,0 +1,300 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# Reproducer: verify a self exported FashionCLIP ONNX model works end to end in
|
| 3 |
+
# a local Typesense container before pushing it to typesense/models-moved.
|
| 4 |
+
#
|
| 5 |
+
# What this script does:
|
| 6 |
+
# 1. Spins up a Typesense Docker container on port 8108.
|
| 7 |
+
# 2. Mounts ./out staging/fashion clip vit b p32/ into the container's models dir
|
| 8 |
+
# so Typesense treats it as a LOCAL model (no namespace prefix, no download).
|
| 9 |
+
# 3. Creates a collection with an image embedding field that points at the model.
|
| 10 |
+
# 4. Indexes a tiny set of fashion images.
|
| 11 |
+
# 5. Runs a text>image search ("a red dress") and an image>image search to
|
| 12 |
+
# confirm the model returns sensible top results.
|
| 13 |
+
#
|
| 14 |
+
# Notes:
|
| 15 |
+
# - This validates the same load path that Typesense uses for ts/clip vit b p32.
|
| 16 |
+
# - Once the files are uploaded to https://huggingface.co/typesense/models moved
|
| 17 |
+
# under fashion clip vit b p32/, Typesense Cloud users can reference it as
|
| 18 |
+
# model_name: "ts/fashion clip vit b p32". The local mode test here is
|
| 19 |
+
# functionally equivalent: same code path, same config, same I/O.
|
| 20 |
+
|
| 21 |
+
set -e
|
| 22 |
+
|
| 23 |
+
# On Git Bash / MSYS2 (Windows), `/data` gets auto-translated to
|
| 24 |
+
# `C:/Program Files/Git/data` which breaks Docker volume mounts and CLI args.
|
| 25 |
+
# Disable that translation for the whole script.
|
| 26 |
+
export MSYS_NO_PATHCONV=1
|
| 27 |
+
export MSYS2_ARG_CONV_EXCL='*'
|
| 28 |
+
|
| 29 |
+
### Configuration ##############################################################
|
| 30 |
+
|
| 31 |
+
TYPESENSE_API_KEY=xyz
|
| 32 |
+
PORT=8108
|
| 33 |
+
TYPESENSE_HOST=http://localhost:${PORT}
|
| 34 |
+
CONTAINER_NAME=typesense-fashion-clip-onnx
|
| 35 |
+
TYPESENSE_VERSION=29.0
|
| 36 |
+
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
| 37 |
+
STAGING_DIR="${SCRIPT_DIR}/out-staging/fashion-clip-vit-b-p32"
|
| 38 |
+
DATA_DIR="${SCRIPT_DIR}/typesense-data-${CONTAINER_NAME}"
|
| 39 |
+
|
| 40 |
+
if [ ! -f "${STAGING_DIR}/model.onnx" ]; then
|
| 41 |
+
echo "ERROR: staging dir missing model.onnx; run the export step first (verify_onnx.py)"
|
| 42 |
+
exit 1
|
| 43 |
+
fi
|
| 44 |
+
|
| 45 |
+
# Python image helper below uses this.
|
| 46 |
+
export SCRIPT_DIR
|
| 47 |
+
|
| 48 |
+
### Cleanup ####################################################################
|
| 49 |
+
|
| 50 |
+
cleanup() {
|
| 51 |
+
echo ""
|
| 52 |
+
echo "=== Cleanup ==="
|
| 53 |
+
docker stop ${CONTAINER_NAME} 2>/dev/null || true
|
| 54 |
+
docker rm ${CONTAINER_NAME} 2>/dev/null || true
|
| 55 |
+
# Container ran as root and wrote into the mounted data dir; on Windows the
|
| 56 |
+
# host user can't `rm` those files. Use a throwaway root container to wipe
|
| 57 |
+
# the dir contents from inside the same UID, then drop the now empty dir.
|
| 58 |
+
if [ -d "${DATA_DIR}" ]; then
|
| 59 |
+
if command -v cygpath > /dev/null 2>&1; then
|
| 60 |
+
_w=$(cygpath -w "${DATA_DIR}")
|
| 61 |
+
else
|
| 62 |
+
_w="${DATA_DIR}"
|
| 63 |
+
fi
|
| 64 |
+
docker run --rm -v "${_w}:/data" alpine:3 sh -c "rm -rf /data/* /data/.??*" 2>/dev/null || true
|
| 65 |
+
rmdir "${DATA_DIR}" 2>/dev/null || true
|
| 66 |
+
fi
|
| 67 |
+
echo "Cleanup complete"
|
| 68 |
+
}
|
| 69 |
+
trap cleanup EXIT
|
| 70 |
+
|
| 71 |
+
### Setup Typesense ############################################################
|
| 72 |
+
|
| 73 |
+
echo "=== Setting up Typesense ${TYPESENSE_VERSION} ==="
|
| 74 |
+
docker stop ${CONTAINER_NAME} 2>/dev/null || true
|
| 75 |
+
docker rm ${CONTAINER_NAME} 2>/dev/null || true
|
| 76 |
+
|
| 77 |
+
mkdir -p "${DATA_DIR}/models"
|
| 78 |
+
# Place the model files under the Typesense data dir at models/fashion-clip-vit-b-p32/
|
| 79 |
+
# (no ts_ prefix => loaded as a local model rather than a public download).
|
| 80 |
+
cp -R "${STAGING_DIR}" "${DATA_DIR}/models/"
|
| 81 |
+
|
| 82 |
+
# Docker on Windows wants a Windows-style host path for -v mounts.
|
| 83 |
+
if command -v cygpath > /dev/null 2>&1; then
|
| 84 |
+
WIN_DATA_DIR=$(cygpath -w "${DATA_DIR}")
|
| 85 |
+
else
|
| 86 |
+
WIN_DATA_DIR="${DATA_DIR}"
|
| 87 |
+
fi
|
| 88 |
+
|
| 89 |
+
docker run -d \
|
| 90 |
+
--name ${CONTAINER_NAME} \
|
| 91 |
+
-p ${PORT}:8108 \
|
| 92 |
+
-v "${WIN_DATA_DIR}:/data" \
|
| 93 |
+
typesense/typesense:${TYPESENSE_VERSION} \
|
| 94 |
+
--data-dir=/data \
|
| 95 |
+
--api-key=${TYPESENSE_API_KEY} \
|
| 96 |
+
--enable-cors
|
| 97 |
+
|
| 98 |
+
echo "Waiting for Typesense to be ready..."
|
| 99 |
+
# /health returns {"ok": true} only once Raft has elected a leader. Poll for the
|
| 100 |
+
# JSON body rather than just exit code so we don't race with leader election.
|
| 101 |
+
for _ in $(seq 1 60); do
|
| 102 |
+
body=$(curl -s "${TYPESENSE_HOST}/health" 2>/dev/null || true)
|
| 103 |
+
if echo "$body" | grep -q '"ok":true'; then break; fi
|
| 104 |
+
sleep 1
|
| 105 |
+
done
|
| 106 |
+
# Belt and suspenders: also confirm /debug returns ok=true (means Raft is ready
|
| 107 |
+
# enough to accept writes). Up to v29, /health alone can lie under fast restart.
|
| 108 |
+
for _ in $(seq 1 30); do
|
| 109 |
+
debug_body=$(curl -s -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "${TYPESENSE_HOST}/debug" 2>/dev/null || true)
|
| 110 |
+
if echo "$debug_body" | grep -q '"state":1'; then break; fi
|
| 111 |
+
sleep 1
|
| 112 |
+
done
|
| 113 |
+
echo "Typesense is ready"
|
| 114 |
+
|
| 115 |
+
wait_for_collection() {
|
| 116 |
+
local collection=$1
|
| 117 |
+
local max_wait=${2:-30}
|
| 118 |
+
local count=0
|
| 119 |
+
while [ $count -lt $max_wait ]; do
|
| 120 |
+
if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/collections/${collection}" \
|
| 121 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null | grep -q "200"; then
|
| 122 |
+
return 0
|
| 123 |
+
fi
|
| 124 |
+
sleep 1
|
| 125 |
+
count=$((count + 1))
|
| 126 |
+
done
|
| 127 |
+
echo "WARNING: Collection '${collection}' not ready after ${max_wait}s"
|
| 128 |
+
return 1
|
| 129 |
+
}
|
| 130 |
+
|
| 131 |
+
### Create collection ##########################################################
|
| 132 |
+
|
| 133 |
+
echo ""
|
| 134 |
+
echo "=== Creating fashion-items collection ==="
|
| 135 |
+
|
| 136 |
+
CREATE_RESP=$(curl -s "${TYPESENSE_HOST}/collections" \
|
| 137 |
+
-X POST \
|
| 138 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
| 139 |
+
-H "Content-Type: application/json" \
|
| 140 |
+
-d '{
|
| 141 |
+
"name": "fashion-items",
|
| 142 |
+
"fields": [
|
| 143 |
+
{"name": "title", "type": "string"},
|
| 144 |
+
{"name": "image", "type": "image", "store": false},
|
| 145 |
+
{
|
| 146 |
+
"name": "embedding",
|
| 147 |
+
"type": "float[]",
|
| 148 |
+
"embed": {
|
| 149 |
+
"from": ["image"],
|
| 150 |
+
"model_config": {"model_name": "fashion-clip-vit-b-p32"}
|
| 151 |
+
}
|
| 152 |
+
}
|
| 153 |
+
]
|
| 154 |
+
}')
|
| 155 |
+
echo "create response: $CREATE_RESP"
|
| 156 |
+
if ! echo "$CREATE_RESP" | jq -e '.name' > /dev/null; then
|
| 157 |
+
echo ""
|
| 158 |
+
echo "Collection creation FAILED. Recent Typesense logs:"
|
| 159 |
+
docker logs --tail 80 ${CONTAINER_NAME}
|
| 160 |
+
exit 1
|
| 161 |
+
fi
|
| 162 |
+
|
| 163 |
+
wait_for_collection "fashion-items"
|
| 164 |
+
|
| 165 |
+
### Index a small fashion dataset ##############################################
|
| 166 |
+
|
| 167 |
+
echo ""
|
| 168 |
+
echo "=== Preparing sample fashion images ==="
|
| 169 |
+
mkdir -p "${SCRIPT_DIR}/sample-images"
|
| 170 |
+
# Python on Windows can't resolve /c/Users style paths, convert to C:\... form.
|
| 171 |
+
if command -v cygpath > /dev/null 2>&1; then
|
| 172 |
+
WIN_SCRIPT_DIR=$(cygpath -w "${SCRIPT_DIR}")
|
| 173 |
+
else
|
| 174 |
+
WIN_SCRIPT_DIR="${SCRIPT_DIR}"
|
| 175 |
+
fi
|
| 176 |
+
export SCRIPT_DIR="${WIN_SCRIPT_DIR}"
|
| 177 |
+
# Use Pillow in the project venv to fetch real photos with a sane User-Agent,
|
| 178 |
+
# falling back to solid-color placeholders that still exercise the embedder.
|
| 179 |
+
PY="${SCRIPT_DIR}/.venv/Scripts/python.exe"
|
| 180 |
+
"$PY" - <<'PYEOF'
|
| 181 |
+
import os, urllib.request
|
| 182 |
+
from PIL import Image, ImageDraw
|
| 183 |
+
|
| 184 |
+
OUT = os.path.join(os.environ['SCRIPT_DIR'], 'sample-images')
|
| 185 |
+
os.makedirs(OUT, exist_ok=True)
|
| 186 |
+
ITEMS = [
|
| 187 |
+
('red-dress', (200, 20, 40), 'https://images.pexels.com/photos/985635/pexels-photo-985635.jpeg?w=400'),
|
| 188 |
+
('blue-jeans', ( 30, 60, 140), 'https://images.pexels.com/photos/52518/jeans-pants-blue-pocket-52518.jpeg?w=400'),
|
| 189 |
+
('white-sneakers', (240, 240, 240), 'https://images.pexels.com/photos/1102776/pexels-photo-1102776.jpeg?w=400'),
|
| 190 |
+
('leather-handbag', (110, 60, 30), 'https://images.pexels.com/photos/1152077/pexels-photo-1152077.jpeg?w=400'),
|
| 191 |
+
('wool-sweater', (180, 140, 90), 'https://images.pexels.com/photos/1721934/pexels-photo-1721934.jpeg?w=400'),
|
| 192 |
+
]
|
| 193 |
+
for name, color, url in ITEMS:
|
| 194 |
+
path = os.path.join(OUT, name + '.jpg')
|
| 195 |
+
if os.path.exists(path) and os.path.getsize(path) > 1024:
|
| 196 |
+
continue
|
| 197 |
+
try:
|
| 198 |
+
req = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
|
| 199 |
+
with urllib.request.urlopen(req, timeout=20) as r:
|
| 200 |
+
data = r.read()
|
| 201 |
+
with open(path,'wb') as f:
|
| 202 |
+
f.write(data)
|
| 203 |
+
print(f' fetched {name}: {len(data)} bytes')
|
| 204 |
+
except Exception as e:
|
| 205 |
+
# placeholder
|
| 206 |
+
img = Image.new('RGB', (400, 400), color)
|
| 207 |
+
d = ImageDraw.Draw(img)
|
| 208 |
+
d.text((10, 10), name, fill=(255,255,255) if sum(color)<400 else (0,0,0))
|
| 209 |
+
img.save(path, 'JPEG', quality=85)
|
| 210 |
+
print(f' placeholder {name} (fetch failed: {e})')
|
| 211 |
+
PYEOF
|
| 212 |
+
|
| 213 |
+
echo ""
|
| 214 |
+
echo "=== Indexing fashion items ==="
|
| 215 |
+
build_doc() {
|
| 216 |
+
local id=$1 title=$2 file=$3
|
| 217 |
+
local b64
|
| 218 |
+
b64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/${file}.jpg")
|
| 219 |
+
printf '{"id":"%s","title":"%s","image":"%s"}\n' "$id" "$title" "$b64"
|
| 220 |
+
}
|
| 221 |
+
{
|
| 222 |
+
build_doc 1 "Red cocktail dress" red-dress
|
| 223 |
+
build_doc 2 "Blue denim jeans" blue-jeans
|
| 224 |
+
build_doc 3 "White leather sneakers" white-sneakers
|
| 225 |
+
build_doc 4 "Brown leather handbag" leather-handbag
|
| 226 |
+
build_doc 5 "Knit wool sweater" wool-sweater
|
| 227 |
+
} | curl -s "${TYPESENSE_HOST}/collections/fashion-items/documents/import?action=create" \
|
| 228 |
+
-X POST \
|
| 229 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
| 230 |
+
-H "Content-Type: text/plain" \
|
| 231 |
+
--data-binary @-
|
| 232 |
+
|
| 233 |
+
echo ""
|
| 234 |
+
|
| 235 |
+
### Text>image search ##########################################################
|
| 236 |
+
|
| 237 |
+
echo "=== Text>image search: 'a red dress' ==="
|
| 238 |
+
curl -s "${TYPESENSE_HOST}/multi_search" \
|
| 239 |
+
-X POST \
|
| 240 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
| 241 |
+
-d '{
|
| 242 |
+
"searches": [
|
| 243 |
+
{
|
| 244 |
+
"collection": "fashion-items",
|
| 245 |
+
"q": "a red dress",
|
| 246 |
+
"query_by": "embedding",
|
| 247 |
+
"prefix": false,
|
| 248 |
+
"include_fields": "id,title",
|
| 249 |
+
"per_page": 3
|
| 250 |
+
}
|
| 251 |
+
]
|
| 252 |
+
}' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
|
| 253 |
+
|
| 254 |
+
echo ""
|
| 255 |
+
echo "=== Text>image search: 'comfortable shoes' ==="
|
| 256 |
+
curl -s "${TYPESENSE_HOST}/multi_search" \
|
| 257 |
+
-X POST \
|
| 258 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
| 259 |
+
-d '{
|
| 260 |
+
"searches": [
|
| 261 |
+
{
|
| 262 |
+
"collection": "fashion-items",
|
| 263 |
+
"q": "comfortable shoes",
|
| 264 |
+
"query_by": "embedding",
|
| 265 |
+
"prefix": false,
|
| 266 |
+
"include_fields": "id,title",
|
| 267 |
+
"per_page": 3
|
| 268 |
+
}
|
| 269 |
+
]
|
| 270 |
+
}' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
|
| 271 |
+
|
| 272 |
+
### Image>image search #########################################################
|
| 273 |
+
|
| 274 |
+
echo ""
|
| 275 |
+
echo "=== Image>image search using red-dress.jpg ==="
|
| 276 |
+
QUERY_IMG_B64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/red-dress.jpg")
|
| 277 |
+
curl -s "${TYPESENSE_HOST}/multi_search" \
|
| 278 |
+
-X POST \
|
| 279 |
+
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
| 280 |
+
-H "Content-Type: application/json" \
|
| 281 |
+
-d "$(jq -n --arg img "$QUERY_IMG_B64" '{
|
| 282 |
+
searches: [{
|
| 283 |
+
collection: "fashion-items",
|
| 284 |
+
q: "*",
|
| 285 |
+
vector_query: ("embedding:([], queries: [\"" + $img + "\"], k: 3)"),
|
| 286 |
+
include_fields: "id,title"
|
| 287 |
+
}]
|
| 288 |
+
}')" | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'
|
| 289 |
+
|
| 290 |
+
### Container logs #############################################################
|
| 291 |
+
|
| 292 |
+
echo ""
|
| 293 |
+
echo "=== Typesense Logs (last 40 lines) ==="
|
| 294 |
+
docker logs --tail 40 ${CONTAINER_NAME}
|
| 295 |
+
|
| 296 |
+
echo ""
|
| 297 |
+
echo "=== VERIFICATION ==="
|
| 298 |
+
echo "Expect text query 'a red dress' to put id=1 (Red cocktail dress) first."
|
| 299 |
+
echo "Expect text query 'comfortable shoes' to put id=3 (White leather sneakers) first."
|
| 300 |
+
echo "Expect image>image query with red-dress.jpg to put id=1 first."
|
fashion-clip-vit-b-p32/_reproducer/verify_onnx.py
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
End-to-end sanity test for the FashionCLIP ONNX export.
|
| 3 |
+
|
| 4 |
+
Replicates the exact I/O contract Typesense uses:
|
| 5 |
+
- model.onnx text path: input_ids (int64) + pixel_values (float32 [1,3,224,224] dummy 0.5) + attention_mask
|
| 6 |
+
-> read 2D [-1, 512] output (text_embeds)
|
| 7 |
+
- model.onnx image path: input_ids dummy + pixel_values + attention_mask
|
| 8 |
+
-> read image_embeds [B, 512]
|
| 9 |
+
- clip_image_processor.onnx: image bytes (uint8 [N]) -> last_hidden_state [1,3,224,224]
|
| 10 |
+
|
| 11 |
+
Also cross-checks ONNX outputs against the original HF transformers FashionCLIP forward pass
|
| 12 |
+
to confirm the export preserves semantics (cosine similarity > 0.999).
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import io
|
| 16 |
+
import os
|
| 17 |
+
import sys
|
| 18 |
+
import hashlib
|
| 19 |
+
|
| 20 |
+
import numpy as np
|
| 21 |
+
import onnxruntime as ort
|
| 22 |
+
from PIL import Image
|
| 23 |
+
import requests
|
| 24 |
+
from transformers import CLIPModel, CLIPProcessor
|
| 25 |
+
import torch
|
| 26 |
+
|
| 27 |
+
STAGING = os.path.join(os.path.dirname(__file__), "out-staging", "fashion-clip-vit-b-p32")
|
| 28 |
+
SRC = os.path.join(os.path.dirname(__file__), "fashion-clip-src")
|
| 29 |
+
MODEL_ONNX = os.path.join(STAGING, "model.onnx")
|
| 30 |
+
PROCESSOR_ONNX = os.path.join(STAGING, "clip_image_processor.onnx")
|
| 31 |
+
|
| 32 |
+
SAMPLE_IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg" # two cats
|
| 33 |
+
SAMPLE_TEXTS = ["a photo of a cat", "a red dress", "blue denim jeans"]
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def fail(msg):
|
| 37 |
+
print(f"FAIL: {msg}")
|
| 38 |
+
sys.exit(1)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def passed(msg):
|
| 42 |
+
print(f"OK : {msg}")
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def get_sample_image_bytes():
|
| 46 |
+
cache = os.path.join(os.path.dirname(__file__), "sample.jpg")
|
| 47 |
+
if not os.path.exists(cache):
|
| 48 |
+
r = requests.get(SAMPLE_IMAGE_URL, timeout=30)
|
| 49 |
+
r.raise_for_status()
|
| 50 |
+
with open(cache, "wb") as f:
|
| 51 |
+
f.write(r.content)
|
| 52 |
+
return open(cache, "rb").read()
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def main():
|
| 56 |
+
print("=== Inspecting model.onnx ===")
|
| 57 |
+
sess = ort.InferenceSession(MODEL_ONNX, providers=["CPUExecutionProvider"])
|
| 58 |
+
in_names = [i.name for i in sess.get_inputs()]
|
| 59 |
+
out_names = [o.name for o in sess.get_outputs()]
|
| 60 |
+
print("inputs :", in_names)
|
| 61 |
+
print("outputs:", out_names)
|
| 62 |
+
assert "input_ids" in in_names, "missing input_ids"
|
| 63 |
+
assert "pixel_values" in in_names, "missing pixel_values"
|
| 64 |
+
assert "attention_mask" in in_names, "missing attention_mask"
|
| 65 |
+
assert "text_embeds" in out_names, "missing text_embeds"
|
| 66 |
+
assert "image_embeds" in out_names, "missing image_embeds"
|
| 67 |
+
# Typesense scans outputs and picks the first 2D [-1, N>0] one as the embedding output.
|
| 68 |
+
# logits_per_* have both dims dynamic, so they get skipped; text_embeds should be the pick.
|
| 69 |
+
typesense_chosen = None
|
| 70 |
+
for o in sess.get_outputs():
|
| 71 |
+
shp = o.shape
|
| 72 |
+
if len(shp) == 2 and shp[0] in (None, "text_batch_size", "image_batch_size", -1) and isinstance(shp[1], int) and shp[1] > 0:
|
| 73 |
+
typesense_chosen = (o.name, shp)
|
| 74 |
+
break
|
| 75 |
+
print("typesense would pick:", typesense_chosen)
|
| 76 |
+
if typesense_chosen[0] != "text_embeds":
|
| 77 |
+
fail(f"expected text_embeds to be the first 2D embedding output, got {typesense_chosen}")
|
| 78 |
+
passed("model.onnx I/O matches Typesense expectations")
|
| 79 |
+
|
| 80 |
+
print()
|
| 81 |
+
print("=== Inspecting clip_image_processor.onnx ===")
|
| 82 |
+
from onnxruntime_extensions import get_library_path
|
| 83 |
+
proc_opts = ort.SessionOptions()
|
| 84 |
+
proc_opts.register_custom_ops_library(get_library_path())
|
| 85 |
+
proc_sess = ort.InferenceSession(PROCESSOR_ONNX, sess_options=proc_opts, providers=["CPUExecutionProvider"])
|
| 86 |
+
pin = [(i.name, i.type, i.shape) for i in proc_sess.get_inputs()]
|
| 87 |
+
pout = [(o.name, o.type, o.shape) for o in proc_sess.get_outputs()]
|
| 88 |
+
print("processor inputs :", pin)
|
| 89 |
+
print("processor outputs:", pout)
|
| 90 |
+
assert pin[0][0] == "image", "processor expects input named 'image'"
|
| 91 |
+
assert pout[0][0] == "last_hidden_state", "processor expects output 'last_hidden_state'"
|
| 92 |
+
passed("clip_image_processor.onnx I/O matches Typesense expectations")
|
| 93 |
+
|
| 94 |
+
print()
|
| 95 |
+
print("=== Running image processor on a real image ===")
|
| 96 |
+
img_bytes = get_sample_image_bytes()
|
| 97 |
+
img_arr = np.frombuffer(img_bytes, dtype=np.uint8)
|
| 98 |
+
proc_out = proc_sess.run(["last_hidden_state"], {"image": img_arr})[0]
|
| 99 |
+
print("processor output shape:", proc_out.shape, "dtype:", proc_out.dtype)
|
| 100 |
+
if proc_out.shape != (1, 3, 224, 224):
|
| 101 |
+
fail(f"expected (1,3,224,224), got {proc_out.shape}")
|
| 102 |
+
passed("image processor returns (1,3,224,224)")
|
| 103 |
+
|
| 104 |
+
print()
|
| 105 |
+
print("=== ONNX vs HF transformers parity check ===")
|
| 106 |
+
# Build reference using the original transformers model.
|
| 107 |
+
hf_model = CLIPModel.from_pretrained(SRC).eval()
|
| 108 |
+
hf_processor = CLIPProcessor.from_pretrained(SRC)
|
| 109 |
+
pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")
|
| 110 |
+
|
| 111 |
+
hf_inputs = hf_processor(text=SAMPLE_TEXTS, images=pil, return_tensors="pt", padding=True)
|
| 112 |
+
with torch.no_grad():
|
| 113 |
+
hf_out = hf_model(**hf_inputs)
|
| 114 |
+
hf_text = hf_out.text_embeds.numpy()
|
| 115 |
+
hf_image = hf_out.image_embeds.numpy()
|
| 116 |
+
|
| 117 |
+
# Now run the same inputs through ONNX model.onnx.
|
| 118 |
+
# Note: ONNX export needs the same pixel_values as HF.
|
| 119 |
+
pixel_values = hf_inputs["pixel_values"].numpy()
|
| 120 |
+
input_ids = hf_inputs["input_ids"].numpy().astype(np.int64)
|
| 121 |
+
attention_mask = hf_inputs["attention_mask"].numpy().astype(np.int64)
|
| 122 |
+
|
| 123 |
+
onnx_out = sess.run(
|
| 124 |
+
["text_embeds", "image_embeds"],
|
| 125 |
+
{
|
| 126 |
+
"input_ids": input_ids,
|
| 127 |
+
"pixel_values": pixel_values,
|
| 128 |
+
"attention_mask": attention_mask,
|
| 129 |
+
},
|
| 130 |
+
)
|
| 131 |
+
onnx_text, onnx_image = onnx_out
|
| 132 |
+
|
| 133 |
+
def cosine(a, b):
|
| 134 |
+
return float(np.sum(a * b) / (np.linalg.norm(a) * np.linalg.norm(b)))
|
| 135 |
+
|
| 136 |
+
for i, txt in enumerate(SAMPLE_TEXTS):
|
| 137 |
+
c = cosine(hf_text[i], onnx_text[i])
|
| 138 |
+
print(f" text[{i}] '{txt}': cosine(HF, ONNX) = {c:.6f}")
|
| 139 |
+
if c < 0.999:
|
| 140 |
+
fail(f"text_embeds parity too low: {c}")
|
| 141 |
+
c_img = cosine(hf_image[0], onnx_image[0])
|
| 142 |
+
print(f" image: cosine(HF, ONNX) = {c_img:.6f}")
|
| 143 |
+
if c_img < 0.999:
|
| 144 |
+
fail(f"image_embeds parity too low: {c_img}")
|
| 145 |
+
passed("ONNX text+image embeddings match HF reference (cosine > 0.999)")
|
| 146 |
+
|
| 147 |
+
print()
|
| 148 |
+
print("=== Typesense-style text path (dummy pixel_values = 0.5) ===")
|
| 149 |
+
# Typesense's text_embedder fills pixel_values with 0.5 for text-only queries.
|
| 150 |
+
# The text_embeds output should not depend on pixel_values (the towers are independent),
|
| 151 |
+
# so this should match the real-image text embedding.
|
| 152 |
+
dummy_pixels = np.full((1, 3, 224, 224), 0.5, dtype=np.float32)
|
| 153 |
+
typesense_text = sess.run(
|
| 154 |
+
["text_embeds"],
|
| 155 |
+
{
|
| 156 |
+
"input_ids": input_ids[:1],
|
| 157 |
+
"attention_mask": attention_mask[:1],
|
| 158 |
+
"pixel_values": dummy_pixels,
|
| 159 |
+
},
|
| 160 |
+
)[0]
|
| 161 |
+
c = cosine(typesense_text[0], onnx_text[0])
|
| 162 |
+
print(f" typesense text path vs onnx text path: cosine = {c:.6f}")
|
| 163 |
+
if c < 0.9999:
|
| 164 |
+
fail(f"text path differs when pixel_values change - towers shouldn't be coupled: {c}")
|
| 165 |
+
passed("text_embeds is independent of pixel_values (Typesense text path safe)")
|
| 166 |
+
|
| 167 |
+
print()
|
| 168 |
+
print("=== Typesense-style image embed path (dummy input_ids) ===")
|
| 169 |
+
# image_embedder.cpp passes input_ids shape [1,1] = [[0]] alongside pixel_values.
|
| 170 |
+
dummy_ids = np.array([[0]], dtype=np.int64)
|
| 171 |
+
dummy_mask = np.array([[1]], dtype=np.int64)
|
| 172 |
+
typesense_image = sess.run(
|
| 173 |
+
["image_embeds"],
|
| 174 |
+
{
|
| 175 |
+
"input_ids": dummy_ids,
|
| 176 |
+
"pixel_values": pixel_values[:1],
|
| 177 |
+
"attention_mask": dummy_mask,
|
| 178 |
+
},
|
| 179 |
+
)[0]
|
| 180 |
+
c = cosine(typesense_image[0], onnx_image[0])
|
| 181 |
+
print(f" typesense image path vs reference: cosine = {c:.6f}")
|
| 182 |
+
if c < 0.9999:
|
| 183 |
+
fail(f"image_embeds differs when input_ids change - towers shouldn't be coupled: {c}")
|
| 184 |
+
passed("image_embeds is independent of input_ids (Typesense image path safe)")
|
| 185 |
+
|
| 186 |
+
print()
|
| 187 |
+
print("=== Computing MD5 of model.onnx for config.json ===")
|
| 188 |
+
h = hashlib.md5()
|
| 189 |
+
with open(MODEL_ONNX, "rb") as f:
|
| 190 |
+
for chunk in iter(lambda: f.read(1 << 20), b""):
|
| 191 |
+
h.update(chunk)
|
| 192 |
+
print("model.onnx md5:", h.hexdigest())
|
| 193 |
+
|
| 194 |
+
print()
|
| 195 |
+
print("ALL CHECKS PASSED")
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
if __name__ == "__main__":
|
| 199 |
+
main()
|