Update README.md

ce813cc verified 9 months ago

4.08 kB

# 📸 MobileCLIP-B Zero-Shot Image Classifier — HF Inference Endpoint

This repository packages Apple’s **MobileCLIP-B** model as a production-ready
Hugging Face Inference Endpoint.

* **One-shot image → class probabilities**  
  ⚡ < 30 ms on an A10G / T4 once the image arrives.
* **Branch-fused / FP16** MobileCLIP for fast GPU inference.
* **Pre-computed text embeddings** for your custom label set
  (`items.json`) — every request encodes **only** the image.
* Built with vanilla **`open-clip-torch`** (no forks) and a
  60-line local helper (`reparam.py`) to fuse MobileOne blocks.

---

## ✨ What’s inside

| File | Purpose |
|------|---------|
| `handler.py` | Hugging Face entry-point — loads weights, caches text features, serves requests |
| `reparam.py` | Stand-alone copy of `reparameterize_model` from Apple’s repo (removes heavy upstream dependency) |
| `requirements.txt` | Minimal, conflict-free dependency set (`torch`, `torchvision`, `open-clip-torch`) |
| `items.json` | Your label spec — each element must have `id`, `name`, and `prompt` fields |
| `README.md` | You are here |

---

## 🔧 Quick start (local smoke-test)

```bash
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python - <<'PY'
from pathlib import Path, PurePosixPath
import base64, json, requests

# Load a demo image and encode it
img_path = Path("tests/cat.jpg")
payload = {
    "image": base64.b64encode(img_path.read_bytes()).decode()
}

# Local simulation — spin up uvicorn the same way the HF container does
import handler, uvicorn
app = handler.EndpointHandler()

print(app({"inputs": payload})[:5])   # top-5 classes
PY

🚀 Calling the deployed endpoint

ENDPOINT_URL="https://<your-endpoint>.aws.endpoints.huggingface.cloud"
HF_TOKEN="hf_xxxxxxxxxxxxxxxxx"
IMG="cat.jpg"

python - <<'PY'
import base64, json, requests, sys, os
url   = os.environ["ENDPOINT_URL"]
token = os.environ["HF_TOKEN"]
img   = sys.argv[1]

payload = {
    "inputs": {
        "image": base64.b64encode(open(img, "rb").read()).decode()
    }
}
resp = requests.post(
    url,
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type":  "application/json",
        "Accept":        "application/json",
    },
    json=payload,
    timeout=60,
)
print(json.dumps(resp.json()[:5], indent=2))   # top-5
PY
$IMG

Sample response:

[
  { "id": 23, "label": "cat",          "score": 0.92 },
  { "id": 11, "label": "tiger cat",    "score": 0.05 },
  { "id": 48, "label": "siamese cat",  "score": 0.02 },
  …
]

🏗️ How the handler works (high-level)

Startup
- Downloads / loads the datacompdr MobileCLIP-B checkpoint.
- Runs reparameterize_model to fuse MobileOne branches.
- Reads items.json, tokenises all prompts, and caches the resulting text embeddings ([n_classes, 512]).
Per request
- Decodes the incoming base-64 JPEG/PNG.
- Applies the exact OpenCLIP preprocessing (224 × 224 center-crop, mean/std normalisation).
- Encodes the image, L2-normalises, and performs one softmax(cosine) against the cached text matrix.
- Returns a sorted JSON list [{"id", "label", "score"}, …].

This design keeps bandwidth low (compressed image over the wire) and latency low (no per-request text encoding).

📝 Updating the label set

Edit items.json, rebuild the endpoint, done.

[
  { "id": 0, "name": "cat",  "prompt": "a photo of a cat"  },
  { "id": 1, "name": "dog",  "prompt": "a photo of a dog"  },
  …
]

id is your internal numeric key (stays stable).
name is the human-readable label returned to clients.
prompt is what the model actually “sees” — tweak wording to improve accuracy.

⚖️ Licence

Weights: Apple AMLR (see LICENSE_weights_data).
Code in this repo: MIT.

_{Maintained with ❤️ by Your Team — August 2025}

``` ::contentReference[oaicite:0]{index=0}