|
|
--- |
|
|
license: mit |
|
|
library_name: onnxruntime |
|
|
tags: |
|
|
- onnx |
|
|
- multimodal |
|
|
- clip |
|
|
- clap |
|
|
- audio |
|
|
- image |
|
|
- text |
|
|
- embeddings |
|
|
- feature-extraction |
|
|
- antfly |
|
|
- termite |
|
|
pipeline_tag: feature-extraction |
|
|
datasets: |
|
|
- OpenSound/AudioCaps |
|
|
--- |
|
|
|
|
|
# CLIPCLAP β Unified Text + Image + Audio Embeddings |
|
|
|
|
|
CLIPCLAP is a unified multimodal embedding model that maps **text**, **images**, and **audio** into a shared 512-dimensional vector space. It combines OpenAI's [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) (text + image) with LAION's [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (audio) through a trained linear projection. |
|
|
|
|
|
Built by [antflydb](https://github.com/antflydb) for use with [Termite](https://github.com/antflydb/antfly/tree/main/termite), a standalone ML inference service for embeddings, chunking, and reranking. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Text βββ CLIP text encoder βββ text_projection βββ 512-dim (CLIP space) |
|
|
Image βββ CLIP visual encoder βββ visual_projection βββ 512-dim (CLIP space) |
|
|
Audio βββ CLAP audio encoder βββ audio_projection βββ 512-dim (CLIP space) |
|
|
``` |
|
|
|
|
|
- **Text & Image**: Standard CLIP ViT-B/32 encoders and projections (unchanged from `openai/clip-vit-base-patch32`). |
|
|
- **Audio**: CLAP HTSAT audio encoder from `laion/larger_clap_music_and_speech`. The audio projection combines CLAP's native audio projection (1024β512) with a trained 512β512 linear layer that maps CLAP audio space into CLIP space. |
|
|
|
|
|
All three modalities produce **512-dimensional L2-normalized embeddings** that are directly comparable via cosine similarity. |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
- Multimodal search (textβimageβaudio) |
|
|
- Building unified media indexes with [Antfly](https://github.com/antflydb/antfly) |
|
|
- Cross-modal retrieval (find images from audio queries, audio from text, etc.) |
|
|
- Audio-visual content discovery |
|
|
|
|
|
## How to Use with Termite |
|
|
|
|
|
```bash |
|
|
# Pull and run the model |
|
|
termite pull clipclap |
|
|
termite run |
|
|
|
|
|
# Embed text |
|
|
curl -X POST http://localhost:8082/embed \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "clipclap", |
|
|
"input": [ |
|
|
{"type": "text", "text": "a cat sitting on a windowsill"}, |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}, |
|
|
{"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}} |
|
|
] |
|
|
}' |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Audio Projection |
|
|
|
|
|
The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure: |
|
|
|
|
|
1. Load audio-caption pairs from [OpenSound/AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps) |
|
|
2. Encode audio through CLAP: audio encoder β audio_projection β L2 normalize |
|
|
3. Encode captions through CLIP: text encoder β text_projection β L2 normalize |
|
|
4. Train a 512β512 linear projection (CLAP audio β CLIP text) using CLIP-style contrastive loss (InfoNCE) |
|
|
|
|
|
The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination. |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Training dataset | OpenSound/AudioCaps | |
|
|
| Samples | 5000 audio-caption pairs | |
|
|
| Epochs | 20 | |
|
|
| Batch size | 256 | |
|
|
| Learning rate | 1e-3 | |
|
|
| Optimizer | Adam | |
|
|
| Loss | Symmetric InfoNCE (temperature=0.07) | |
|
|
| Train/val split | 90/10 | |
|
|
|
|
|
### Source Models |
|
|
|
|
|
| Component | Model | |
|
|
|-----------|-------| |
|
|
| CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | |
|
|
| CLAP | [laion/larger_clap_music_and_speech](https://huggingface.co/laion/larger_clap_music_and_speech) | |
|
|
|
|
|
## ONNX Files |
|
|
|
|
|
| File | Description | Size | |
|
|
|------|-------------|------| |
|
|
| `text_model.onnx` | CLIP text encoder | ~254 MB | |
|
|
| `visual_model.onnx` | CLIP visual encoder | ~330 MB | |
|
|
| `text_projection.onnx` | CLIP text projection (512β512) | ~4 KB | |
|
|
| `visual_projection.onnx` | CLIP visual projection (768β512) | ~6 KB | |
|
|
| `audio_model.onnx` | CLAP HTSAT audio encoder | ~590 MB | |
|
|
| `audio_projection.onnx` | Combined CLAPβCLIP projection (1024β512) | ~8 KB | |
|
|
|
|
|
Additional files: `clip_config.json`, `tokenizer.json`, `preprocessor_config.json`, `projection_training_metadata.json`. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Audio duration**: Audio is truncated to ~10 seconds (inherited from CLAP) |
|
|
- **Language**: Primarily English text support |
|
|
- **Audio-visual alignment**: The projection is trained via caption similarity (audioβtextβimage), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image. |
|
|
- **CLIP limitations**: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts |
|
|
- **Training data**: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use CLIPCLAP, please cite the underlying models: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{radford2021clip, |
|
|
title={Learning Transferable Visual Models From Natural Language Supervision}, |
|
|
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others}, |
|
|
booktitle={ICML}, |
|
|
year={2021} |
|
|
} |
|
|
|
|
|
@inproceedings{wu2023clap, |
|
|
title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}, |
|
|
author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others}, |
|
|
booktitle={ICASSP}, |
|
|
year={2023} |
|
|
} |
|
|
``` |
|
|
|