|
|
--- |
|
|
title: VidEmbed |
|
|
emoji: 💻 |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: cc |
|
|
short_description: generate embeddings of youtube video text and image data |
|
|
--- |
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
|
|
# Hugging Face Space for multimodal embeddings |
|
|
|
|
|
Files: |
|
|
- app.py |
|
|
- requirements.txt |
|
|
- finetuned_multimodal.pt <-- upload your 3GB checkpoint here |
|
|
|
|
|
How to deploy: |
|
|
1. Create a new Space on Hugging Face (https://huggingface.co/spaces). |
|
|
- SDK: Gradio |
|
|
- Hardware: If you need GPU inference, switch to a GPU runtime (note: GPU access may require a paid plan). |
|
|
2. Upload these files to the Space repository. For a large checkpoint (~3GB) use Git LFS or the web UI file upload (the web UI supports large uploads). |
|
|
3. Wait for the Space to build. The app serves: |
|
|
- Web UI at `/` (Gradio) |
|
|
- API endpoint at `/api/get_embedding` |
|
|
|
|
|
API example: |
|
|
POST JSON to `/api/get_embedding`: |
|
|
{ |
|
|
"title": "My video", |
|
|
"description": "Some description", |
|
|
"tags": "tag1,tag2", |
|
|
"thumbnail_url": "https://..." |
|
|
} |
|
|
|
|
|
Response: |
|
|
{ |
|
|
"embedding": [0.123, -0.456, ...] # fused vector used in training |
|
|
} |
|
|
|
|
|
Notes: |
|
|
- The app replicates the same fused embedding pipeline used in your notebook: |
|
|
text -> text encoder -> text_proj |
|
|
thumbnail -> image encoder -> img_proj |
|
|
fused = MultiheadAttention(query=text_proj, key=img_proj, value=img_proj) |
|
|
fused is returned as the embedding vector. |
|
|
- If your checkpoint uses different key names, the loader prints a warning; update loader accordingly. |
|
|
- If your model requires different shapes (text_dim/img_dim/proj_dim) adjust MultimodalRegressor init params. |
|
|
|