File size: 1,698 Bytes
d35c7a5
 
 
 
 
 
 
 
 
 
 
 
 
 
fa94305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: VidEmbed
emoji: 💻
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.0
app_file: app.py
pinned: false
license: cc
short_description: generate embeddings of youtube video text and image data
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Hugging Face Space for multimodal embeddings

Files:
- app.py
- requirements.txt
- finetuned_multimodal.pt  <-- upload your 3GB checkpoint here

How to deploy:
1. Create a new Space on Hugging Face (https://huggingface.co/spaces).
   - SDK: Gradio
   - Hardware: If you need GPU inference, switch to a GPU runtime (note: GPU access may require a paid plan).
2. Upload these files to the Space repository. For a large checkpoint (~3GB) use Git LFS or the web UI file upload (the web UI supports large uploads).
3. Wait for the Space to build. The app serves:
   - Web UI at `/` (Gradio)
   - API endpoint at `/api/get_embedding`

API example:
POST JSON to `/api/get_embedding`:
{
  "title": "My video",
  "description": "Some description",
  "tags": "tag1,tag2",
  "thumbnail_url": "https://..."
}

Response:
{
  "embedding": [0.123, -0.456, ...]  # fused vector used in training
}

Notes:
- The app replicates the same fused embedding pipeline used in your notebook:
  text -> text encoder -> text_proj
  thumbnail -> image encoder -> img_proj
  fused = MultiheadAttention(query=text_proj, key=img_proj, value=img_proj)
  fused is returned as the embedding vector.
- If your checkpoint uses different key names, the loader prints a warning; update loader accordingly.
- If your model requires different shapes (text_dim/img_dim/proj_dim) adjust MultimodalRegressor init params.