Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)

#16

by alangmartini - opened May 20

base: refs/heads/main

←

from: refs/pr/16

Discussion Files changed

+262220

-0

Add fashion-clip-vit-b-p32 (ONNX of patrickjohncyh/fashion-clip)abea3dab

alangmartini

May 20

•

edited May 20

FashionCLIP fine-tuned on the Farfetch fashion catalog. Same CLIP-ViT-B/32 architecture as clip-vit-b-p32, so clip_image_processor.onnx and vocab.txt are reused byte-for-byte. Verified end-to-end against local Typesense 29.0 and 30.2 containers: model loads, text-image and image-image search return correct top results. Distances are bit-exact across the two versions.

What's in the bundle

fashion-clip-vit-b-p32/
  model.onnx                   605 MB  new
  clip_tokenizer.onnx          1.3 MB  reused from clip-vit-b-p32
  clip_image_processor.onnx    4 KB    reused from clip-vit-b-p32
  vocab.txt                    3.0 MB  reused (CLIP BPE merges)
  config.json                  Typesense metadata + MD5 checksums
  README.md                    HF model card with citation

Source model: https://huggingface.co/patrickjohncyh/fashion-clip (MIT license, preserved on the HF README inside the bundle).

Verification

Two scripts were used to validate the bundle before opening this PR. They are inline below (not part of the model directory) so a reviewer can reproduce the results without trusting my word.

Expected output

=== Typesense 29.0 / 30.2 ===
text 'a red dress'        -> id=1 Red cocktail dress      (vector_distance 0.8147)
text 'comfortable shoes'  -> id=3 White leather sneakers  (vector_distance 0.7032)
image red-dress.jpg       -> id=1, id=2, id=5

ONNX vs HF transformers parity: cosine = 1.000000 across all texts and the image.

verify_onnx.py - ONNX runtime parity check vs HF transformers reference (click to expand)

Loads model.onnx and clip_image_processor.onnx, validates Typesense's exact I/O contract (input/output tensor names, shapes, independence of text vs image towers), and confirms the ONNX outputs match the HF transformers FashionCLIP forward pass at cosine = 1.0 for both text and image. Doesn't require Docker.

"""
End-to-end sanity test for the FashionCLIP ONNX export.

Replicates the exact I/O contract Typesense uses:
- model.onnx text path: input_ids (int64) + pixel_values (float32 [1,3,224,224] dummy 0.5) + attention_mask
                        -> read 2D [-1, 512] output (text_embeds)
- model.onnx image path: input_ids dummy + pixel_values + attention_mask
                         -> read image_embeds [B, 512]
- clip_image_processor.onnx: image bytes (uint8 [N]) -> last_hidden_state [1,3,224,224]

Also cross-checks ONNX outputs against the original HF transformers FashionCLIP forward pass
to confirm the export preserves semantics (cosine similarity > 0.999).
"""

import io
import os
import sys
import hashlib

import numpy as np
import onnxruntime as ort
from PIL import Image
import requests
from transformers import CLIPModel, CLIPProcessor
import torch

STAGING = os.path.join(os.path.dirname(__file__), "out-staging", "fashion-clip-vit-b-p32")
SRC = os.path.join(os.path.dirname(__file__), "fashion-clip-src")
MODEL_ONNX = os.path.join(STAGING, "model.onnx")
PROCESSOR_ONNX = os.path.join(STAGING, "clip_image_processor.onnx")

SAMPLE_IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"  # two cats
SAMPLE_TEXTS = ["a photo of a cat", "a red dress", "blue denim jeans"]


def fail(msg):
    print(f"FAIL: {msg}")
    sys.exit(1)


def passed(msg):
    print(f"OK  : {msg}")


def get_sample_image_bytes():
    cache = os.path.join(os.path.dirname(__file__), "sample.jpg")
    if not os.path.exists(cache):
        r = requests.get(SAMPLE_IMAGE_URL, timeout=30)
        r.raise_for_status()
        with open(cache, "wb") as f:
            f.write(r.content)
    return open(cache, "rb").read()


def main():
    print("=== Inspecting model.onnx ===")
    sess = ort.InferenceSession(MODEL_ONNX, providers=["CPUExecutionProvider"])
    in_names = [i.name for i in sess.get_inputs()]
    out_names = [o.name for o in sess.get_outputs()]
    print("inputs :", in_names)
    print("outputs:", out_names)
    assert "input_ids" in in_names, "missing input_ids"
    assert "pixel_values" in in_names, "missing pixel_values"
    assert "attention_mask" in in_names, "missing attention_mask"
    assert "text_embeds" in out_names, "missing text_embeds"
    assert "image_embeds" in out_names, "missing image_embeds"
    # Typesense scans outputs and picks the first 2D [-1, N>0] one as the embedding output.
    # logits_per_* have both dims dynamic, so they get skipped; text_embeds should be the pick.
    typesense_chosen = None
    for o in sess.get_outputs():
        shp = o.shape
        if len(shp) == 2 and shp[0] in (None, "text_batch_size", "image_batch_size", -1) and isinstance(shp[1], int) and shp[1] > 0:
            typesense_chosen = (o.name, shp)
            break
    print("typesense would pick:", typesense_chosen)
    if typesense_chosen[0] != "text_embeds":
        fail(f"expected text_embeds to be the first 2D embedding output, got {typesense_chosen}")
    passed("model.onnx I/O matches Typesense expectations")

    print()
    print("=== Inspecting clip_image_processor.onnx ===")
    from onnxruntime_extensions import get_library_path
    proc_opts = ort.SessionOptions()
    proc_opts.register_custom_ops_library(get_library_path())
    proc_sess = ort.InferenceSession(PROCESSOR_ONNX, sess_options=proc_opts, providers=["CPUExecutionProvider"])
    pin = [(i.name, i.type, i.shape) for i in proc_sess.get_inputs()]
    pout = [(o.name, o.type, o.shape) for o in proc_sess.get_outputs()]
    print("processor inputs :", pin)
    print("processor outputs:", pout)
    assert pin[0][0] == "image", "processor expects input named 'image'"
    assert pout[0][0] == "last_hidden_state", "processor expects output 'last_hidden_state'"
    passed("clip_image_processor.onnx I/O matches Typesense expectations")

    print()
    print("=== Running image processor on a real image ===")
    img_bytes = get_sample_image_bytes()
    img_arr = np.frombuffer(img_bytes, dtype=np.uint8)
    proc_out = proc_sess.run(["last_hidden_state"], {"image": img_arr})[0]
    print("processor output shape:", proc_out.shape, "dtype:", proc_out.dtype)
    if proc_out.shape != (1, 3, 224, 224):
        fail(f"expected (1,3,224,224), got {proc_out.shape}")
    passed("image processor returns (1,3,224,224)")

    print()
    print("=== ONNX vs HF transformers parity check ===")
    # Build reference using the original transformers model.
    hf_model = CLIPModel.from_pretrained(SRC).eval()
    hf_processor = CLIPProcessor.from_pretrained(SRC)
    pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")

    hf_inputs = hf_processor(text=SAMPLE_TEXTS, images=pil, return_tensors="pt", padding=True)
    with torch.no_grad():
        hf_out = hf_model(**hf_inputs)
    hf_text = hf_out.text_embeds.numpy()
    hf_image = hf_out.image_embeds.numpy()

    # Now run the same inputs through ONNX model.onnx.
    # Note: ONNX export needs the same pixel_values as HF.
    pixel_values = hf_inputs["pixel_values"].numpy()
    input_ids = hf_inputs["input_ids"].numpy().astype(np.int64)
    attention_mask = hf_inputs["attention_mask"].numpy().astype(np.int64)

    onnx_out = sess.run(
        ["text_embeds", "image_embeds"],
        {
            "input_ids": input_ids,
            "pixel_values": pixel_values,
            "attention_mask": attention_mask,
        },
    )
    onnx_text, onnx_image = onnx_out

    def cosine(a, b):
        return float(np.sum(a * b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    for i, txt in enumerate(SAMPLE_TEXTS):
        c = cosine(hf_text[i], onnx_text[i])
        print(f"  text[{i}] '{txt}': cosine(HF, ONNX) = {c:.6f}")
        if c < 0.999:
            fail(f"text_embeds parity too low: {c}")
    c_img = cosine(hf_image[0], onnx_image[0])
    print(f"  image: cosine(HF, ONNX) = {c_img:.6f}")
    if c_img < 0.999:
        fail(f"image_embeds parity too low: {c_img}")
    passed("ONNX text+image embeddings match HF reference (cosine > 0.999)")

    print()
    print("=== Typesense-style text path (dummy pixel_values = 0.5) ===")
    # Typesense's text_embedder fills pixel_values with 0.5 for text-only queries.
    # The text_embeds output should not depend on pixel_values (the towers are independent),
    # so this should match the real-image text embedding.
    dummy_pixels = np.full((1, 3, 224, 224), 0.5, dtype=np.float32)
    typesense_text = sess.run(
        ["text_embeds"],
        {
            "input_ids": input_ids[:1],
            "attention_mask": attention_mask[:1],
            "pixel_values": dummy_pixels,
        },
    )[0]
    c = cosine(typesense_text[0], onnx_text[0])
    print(f"  typesense text path vs onnx text path: cosine = {c:.6f}")
    if c < 0.9999:
        fail(f"text path differs when pixel_values change - towers shouldn't be coupled: {c}")
    passed("text_embeds is independent of pixel_values (Typesense text path safe)")

    print()
    print("=== Typesense-style image embed path (dummy input_ids) ===")
    # image_embedder.cpp passes input_ids shape [1,1] = [[0]] alongside pixel_values.
    dummy_ids = np.array([[0]], dtype=np.int64)
    dummy_mask = np.array([[1]], dtype=np.int64)
    typesense_image = sess.run(
        ["image_embeds"],
        {
            "input_ids": dummy_ids,
            "pixel_values": pixel_values[:1],
            "attention_mask": dummy_mask,
        },
    )[0]
    c = cosine(typesense_image[0], onnx_image[0])
    print(f"  typesense image path vs reference: cosine = {c:.6f}")
    if c < 0.9999:
        fail(f"image_embeds differs when input_ids change - towers shouldn't be coupled: {c}")
    passed("image_embeds is independent of input_ids (Typesense image path safe)")

    print()
    print("=== Computing MD5 of model.onnx for config.json ===")
    h = hashlib.md5()
    with open(MODEL_ONNX, "rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):
            h.update(chunk)
    print("model.onnx md5:", h.hexdigest())

    print()
    print("ALL CHECKS PASSED")


if __name__ == "__main__":
    main()

reproduce.sh - live Typesense container end-to-end (click to expand)

Spins up a Typesense container, mounts the staged model as a local model, creates a collection with an image embedding field that points at fashion-clip-vit-b-p32, indexes 5 sample fashion items, and runs text-image and image-image searches. Tested clean on typesense/typesense:29.0 and typesense/typesense:30.2.

#!/bin/bash
# Reproducer: verify a self exported FashionCLIP ONNX model works end to end in
# a local Typesense container before pushing it to typesense/models-moved.
#
# What this script does:
#   1. Spins up a Typesense Docker container on port 8108.
#   2. Mounts ./out staging/fashion clip vit b p32/ into the container's models dir
#      so Typesense treats it as a LOCAL model (no namespace prefix, no download).
#   3. Creates a collection with an image embedding field that points at the model.
#   4. Indexes a tiny set of fashion images.
#   5. Runs a text>image search ("a red dress") and an image>image search to
#      confirm the model returns sensible top results.
#
# Notes:
#   - This validates the same load path that Typesense uses for ts/clip vit b p32.
#   - Once the files are uploaded to https://huggingface.co/typesense/models moved
#     under fashion clip vit b p32/, Typesense Cloud users can reference it as
#     model_name: "ts/fashion clip vit b p32". The local mode test here is
#     functionally equivalent: same code path, same config, same I/O.

set -e

# On Git Bash / MSYS2 (Windows), `/data` gets auto-translated to
# `C:/Program Files/Git/data` which breaks Docker volume mounts and CLI args.
# Disable that translation for the whole script.
export MSYS_NO_PATHCONV=1
export MSYS2_ARG_CONV_EXCL='*'

### Configuration ##############################################################

TYPESENSE_API_KEY=xyz
PORT=8108
TYPESENSE_HOST=http://localhost:${PORT}
CONTAINER_NAME=typesense-fashion-clip-onnx
TYPESENSE_VERSION=29.0
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
STAGING_DIR="${SCRIPT_DIR}/out-staging/fashion-clip-vit-b-p32"
DATA_DIR="${SCRIPT_DIR}/typesense-data-${CONTAINER_NAME}"

if [ ! -f "${STAGING_DIR}/model.onnx" ]; then
  echo "ERROR: staging dir missing model.onnx; run the export step first (verify_onnx.py)"
  exit 1
fi

# Python image helper below uses this.
export SCRIPT_DIR

### Cleanup ####################################################################

cleanup() {
  echo ""
  echo "=== Cleanup ==="
  docker stop ${CONTAINER_NAME} 2>/dev/null || true
  docker rm ${CONTAINER_NAME} 2>/dev/null || true
  # Container ran as root and wrote into the mounted data dir; on Windows the
  # host user can't `rm` those files. Use a throwaway root container to wipe
  # the dir contents from inside the same UID, then drop the now empty dir.
  if [ -d "${DATA_DIR}" ]; then
    if command -v cygpath > /dev/null 2>&1; then
      _w=$(cygpath -w "${DATA_DIR}")
    else
      _w="${DATA_DIR}"
    fi
    docker run --rm -v "${_w}:/data" alpine:3 sh -c "rm -rf /data/* /data/.??*" 2>/dev/null || true
    rmdir "${DATA_DIR}" 2>/dev/null || true
  fi
  echo "Cleanup complete"
}
trap cleanup EXIT

### Setup Typesense ############################################################

echo "=== Setting up Typesense ${TYPESENSE_VERSION} ==="
docker stop ${CONTAINER_NAME} 2>/dev/null || true
docker rm ${CONTAINER_NAME} 2>/dev/null || true

mkdir -p "${DATA_DIR}/models"
# Place the model files under the Typesense data dir at models/fashion-clip-vit-b-p32/
# (no ts_ prefix => loaded as a local model rather than a public download).
cp -R "${STAGING_DIR}" "${DATA_DIR}/models/"

# Docker on Windows wants a Windows-style host path for -v mounts.
if command -v cygpath > /dev/null 2>&1; then
  WIN_DATA_DIR=$(cygpath -w "${DATA_DIR}")
else
  WIN_DATA_DIR="${DATA_DIR}"
fi

docker run -d \
  --name ${CONTAINER_NAME} \
  -p ${PORT}:8108 \
  -v "${WIN_DATA_DIR}:/data" \
  typesense/typesense:${TYPESENSE_VERSION} \
  --data-dir=/data \
  --api-key=${TYPESENSE_API_KEY} \
  --enable-cors

echo "Waiting for Typesense to be ready..."
# /health returns {"ok": true} only once Raft has elected a leader. Poll for the
# JSON body rather than just exit code so we don't race with leader election.
for _ in $(seq 1 60); do
  body=$(curl -s "${TYPESENSE_HOST}/health" 2>/dev/null || true)
  if echo "$body" | grep -q '"ok":true'; then break; fi
  sleep 1
done
# Belt and suspenders: also confirm /debug returns ok=true (means Raft is ready
# enough to accept writes). Up to v29, /health alone can lie under fast restart.
for _ in $(seq 1 30); do
  debug_body=$(curl -s -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "${TYPESENSE_HOST}/debug" 2>/dev/null || true)
  if echo "$debug_body" | grep -q '"state":1'; then break; fi
  sleep 1
done
echo "Typesense is ready"

wait_for_collection() {
  local collection=$1
  local max_wait=${2:-30}
  local count=0
  while [ $count -lt $max_wait ]; do
    if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/collections/${collection}" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null | grep -q "200"; then
      return 0
    fi
    sleep 1
    count=$((count + 1))
  done
  echo "WARNING: Collection '${collection}' not ready after ${max_wait}s"
  return 1
}

### Create collection ##########################################################

echo ""
echo "=== Creating fashion-items collection ==="

CREATE_RESP=$(curl -s "${TYPESENSE_HOST}/collections" \
  -X POST \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "fashion-items",
    "fields": [
      {"name": "title",  "type": "string"},
      {"name": "image",  "type": "image", "store": false},
      {
        "name": "embedding",
        "type": "float[]",
        "embed": {
          "from": ["image"],
          "model_config": {"model_name": "fashion-clip-vit-b-p32"}
        }
      }
    ]
  }')
echo "create response: $CREATE_RESP"
if ! echo "$CREATE_RESP" | jq -e '.name' > /dev/null; then
  echo ""
  echo "Collection creation FAILED. Recent Typesense logs:"
  docker logs --tail 80 ${CONTAINER_NAME}
  exit 1
fi

wait_for_collection "fashion-items"

### Index a small fashion dataset ##############################################

echo ""
echo "=== Preparing sample fashion images ==="
mkdir -p "${SCRIPT_DIR}/sample-images"
# Python on Windows can't resolve /c/Users style paths, convert to C:\... form.
if command -v cygpath > /dev/null 2>&1; then
  WIN_SCRIPT_DIR=$(cygpath -w "${SCRIPT_DIR}")
else
  WIN_SCRIPT_DIR="${SCRIPT_DIR}"
fi
export SCRIPT_DIR="${WIN_SCRIPT_DIR}"
# Use Pillow in the project venv to fetch real photos with a sane User-Agent,
# falling back to solid-color placeholders that still exercise the embedder.
PY="${SCRIPT_DIR}/.venv/Scripts/python.exe"
"$PY" - <<'PYEOF'
import os, urllib.request
from PIL import Image, ImageDraw

OUT = os.path.join(os.environ['SCRIPT_DIR'], 'sample-images')
os.makedirs(OUT, exist_ok=True)
ITEMS = [
    ('red-dress',        (200,  20,  40), 'https://images.pexels.com/photos/985635/pexels-photo-985635.jpeg?w=400'),
    ('blue-jeans',       ( 30,  60, 140), 'https://images.pexels.com/photos/52518/jeans-pants-blue-pocket-52518.jpeg?w=400'),
    ('white-sneakers',   (240, 240, 240), 'https://images.pexels.com/photos/1102776/pexels-photo-1102776.jpeg?w=400'),
    ('leather-handbag',  (110,  60,  30), 'https://images.pexels.com/photos/1152077/pexels-photo-1152077.jpeg?w=400'),
    ('wool-sweater',     (180, 140,  90), 'https://images.pexels.com/photos/1721934/pexels-photo-1721934.jpeg?w=400'),
]
for name, color, url in ITEMS:
    path = os.path.join(OUT, name + '.jpg')
    if os.path.exists(path) and os.path.getsize(path) > 1024:
        continue
    try:
        req = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
        with urllib.request.urlopen(req, timeout=20) as r:
            data = r.read()
        with open(path,'wb') as f:
            f.write(data)
        print(f'  fetched {name}: {len(data)} bytes')
    except Exception as e:
        # placeholder
        img = Image.new('RGB', (400, 400), color)
        d = ImageDraw.Draw(img)
        d.text((10, 10), name, fill=(255,255,255) if sum(color)<400 else (0,0,0))
        img.save(path, 'JPEG', quality=85)
        print(f'  placeholder {name} (fetch failed: {e})')
PYEOF

echo ""
echo "=== Indexing fashion items ==="
build_doc() {
  local id=$1 title=$2 file=$3
  local b64
  b64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/${file}.jpg")
  printf '{"id":"%s","title":"%s","image":"%s"}\n' "$id" "$title" "$b64"
}
{
  build_doc 1 "Red cocktail dress"        red-dress
  build_doc 2 "Blue denim jeans"          blue-jeans
  build_doc 3 "White leather sneakers"    white-sneakers
  build_doc 4 "Brown leather handbag"     leather-handbag
  build_doc 5 "Knit wool sweater"         wool-sweater
} | curl -s "${TYPESENSE_HOST}/collections/fashion-items/documents/import?action=create" \
    -X POST \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -H "Content-Type: text/plain" \
    --data-binary @-

echo ""

### Text>image search ##########################################################

echo "=== Text>image search: 'a red dress' ==="
curl -s "${TYPESENSE_HOST}/multi_search" \
  -X POST \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
    "searches": [
      {
        "collection": "fashion-items",
        "q": "a red dress",
        "query_by": "embedding",
        "prefix": false,
        "include_fields": "id,title",
        "per_page": 3
      }
    ]
  }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'

echo ""
echo "=== Text>image search: 'comfortable shoes' ==="
curl -s "${TYPESENSE_HOST}/multi_search" \
  -X POST \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
    "searches": [
      {
        "collection": "fashion-items",
        "q": "comfortable shoes",
        "query_by": "embedding",
        "prefix": false,
        "include_fields": "id,title",
        "per_page": 3
      }
    ]
  }' | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'

### Image>image search #########################################################

echo ""
echo "=== Image>image search using red-dress.jpg ==="
QUERY_IMG_B64=$(base64 -w0 "${SCRIPT_DIR}/sample-images/red-dress.jpg")
curl -s "${TYPESENSE_HOST}/multi_search" \
  -X POST \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg img "$QUERY_IMG_B64" '{
    searches: [{
      collection: "fashion-items",
      q: "*",
      vector_query: ("embedding:([], queries: [\"" + $img + "\"], k: 3)"),
      include_fields: "id,title"
    }]
  }')" | jq '.results[0].hits[] | {id: .document.id, title: .document.title, distance: .vector_distance}'

### Container logs #############################################################

echo ""
echo "=== Typesense Logs (last 40 lines) ==="
docker logs --tail 40 ${CONTAINER_NAME}

echo ""
echo "=== VERIFICATION ==="
echo "Expect text query 'a red dress' to put id=1 (Red cocktail dress) first."
echo "Expect text query 'comfortable shoes' to put id=3 (White leather sneakers) first."
echo "Expect image>image query with red-dress.jpg to put id=1 first."

alangmartini

May 20

This comment has been hidden

Attach reproducer + verifier scripts as proofbd847c96

alangmartini

May 20

This comment has been hidden

Remove _reproducer; scripts will live in PR description insteada094735b

kishorenc changed pull request status to merged May 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment