--- title: VidEmbed emoji: 💻 colorFrom: purple colorTo: blue sdk: gradio sdk_version: 5.49.0 app_file: app.py pinned: false license: cc short_description: generate embeddings of youtube video text and image data --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Hugging Face Space for multimodal embeddings Files: - app.py - requirements.txt - finetuned_multimodal.pt <-- upload your 3GB checkpoint here How to deploy: 1. Create a new Space on Hugging Face (https://huggingface.co/spaces). - SDK: Gradio - Hardware: If you need GPU inference, switch to a GPU runtime (note: GPU access may require a paid plan). 2. Upload these files to the Space repository. For a large checkpoint (~3GB) use Git LFS or the web UI file upload (the web UI supports large uploads). 3. Wait for the Space to build. The app serves: - Web UI at `/` (Gradio) - API endpoint at `/api/get_embedding` API example: POST JSON to `/api/get_embedding`: { "title": "My video", "description": "Some description", "tags": "tag1,tag2", "thumbnail_url": "https://..." } Response: { "embedding": [0.123, -0.456, ...] # fused vector used in training } Notes: - The app replicates the same fused embedding pipeline used in your notebook: text -> text encoder -> text_proj thumbnail -> image encoder -> img_proj fused = MultiheadAttention(query=text_proj, key=img_proj, value=img_proj) fused is returned as the embedding vector. - If your checkpoint uses different key names, the loader prints a warning; update loader accordingly. - If your model requires different shapes (text_dim/img_dim/proj_dim) adjust MultimodalRegressor init params.