license: mit
library_name: onnxruntime
tags:
- onnx
- multimodal
- clip
- clap
- audio
- image
- text
- embeddings
- feature-extraction
- antfly
- termite
pipeline_tag: feature-extraction
datasets:
- OpenSound/AudioCaps
CLIPCLAP — Unified Text + Image + Audio Embeddings
CLIPCLAP is a unified multimodal embedding model that maps text, images, and audio into a shared 512-dimensional vector space. It combines OpenAI's CLIP (text + image) with LAION's CLAP (audio) through a trained linear projection.
Built by antflydb for use with Termite, a standalone ML inference service for embeddings, chunking, and reranking.
Architecture
Text ──→ CLIP text encoder ──→ text_projection ──→ 512-dim (CLIP space)
Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
Audio ──→ CLAP audio encoder ──→ audio_projection ──→ 512-dim (CLIP space)
- Text & Image: Standard CLIP ViT-B/32 encoders and projections (unchanged from
openai/clip-vit-base-patch32). - Audio: CLAP HTSAT audio encoder from
laion/larger_clap_music_and_speech. The audio projection combines CLAP's native audio projection (1024→512) with a trained 512→512 linear layer that maps CLAP audio space into CLIP space.
All three modalities produce 512-dimensional L2-normalized embeddings that are directly comparable via cosine similarity.
Intended Uses
- Multimodal search (text↔image↔audio)
- Building unified media indexes with Antfly
- Cross-modal retrieval (find images from audio queries, audio from text, etc.)
- Audio-visual content discovery
How to Use with Termite
# Pull and run the model
termite pull clipclap
termite run
# Embed text
curl -X POST http://localhost:8082/embed \
-H "Content-Type: application/json" \
-d '{
"model": "clipclap",
"input": [
{"type": "text", "text": "a cat sitting on a windowsill"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
{"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
]
}'
Training Details
Audio Projection
The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:
- Load audio-caption pairs from OpenSound/AudioCaps
- Encode audio through CLAP: audio encoder → audio_projection → L2 normalize
- Encode captions through CLIP: text encoder → text_projection → L2 normalize
- Train a 512→512 linear projection (CLAP audio → CLIP text) using CLIP-style contrastive loss (InfoNCE)
The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.
Hyperparameters
| Parameter | Value |
|---|---|
| Training dataset | OpenSound/AudioCaps |
| Samples | 5000 audio-caption pairs |
| Epochs | 20 |
| Batch size | 256 |
| Learning rate | 1e-3 |
| Optimizer | Adam |
| Loss | Symmetric InfoNCE (temperature=0.07) |
| Train/val split | 90/10 |
Source Models
| Component | Model |
|---|---|
| CLIP | openai/clip-vit-base-patch32 |
| CLAP | laion/larger_clap_music_and_speech |
ONNX Files
| File | Description | Size |
|---|---|---|
text_model.onnx |
CLIP text encoder | ~254 MB |
visual_model.onnx |
CLIP visual encoder | ~330 MB |
text_projection.onnx |
CLIP text projection (512→512) | ~4 KB |
visual_projection.onnx |
CLIP visual projection (768→512) | ~6 KB |
audio_model.onnx |
CLAP HTSAT audio encoder | ~590 MB |
audio_projection.onnx |
Combined CLAP→CLIP projection (1024→512) | ~8 KB |
Additional files: clip_config.json, tokenizer.json, preprocessor_config.json, projection_training_metadata.json.
Limitations
- Audio duration: Audio is truncated to ~10 seconds (inherited from CLAP)
- Language: Primarily English text support
- Audio-visual alignment: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
- CLIP limitations: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
- Training data: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains
Citation
If you use CLIPCLAP, please cite the underlying models:
@inproceedings{radford2021clip,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
booktitle={ICML},
year={2021}
}
@inproceedings{wu2023clap,
title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
booktitle={ICASSP},
year={2023}
}