File size: 5,421 Bytes
2530c24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: mit
library_name: onnxruntime
tags:
- onnx
- multimodal
- clip
- clap
- audio
- image
- text
- embeddings
- feature-extraction
- antfly
- termite
pipeline_tag: feature-extraction
datasets:
- OpenSound/AudioCaps
---
# CLIPCLAP — Unified Text + Image + Audio Embeddings
CLIPCLAP is a unified multimodal embedding model that maps **text**, **images**, and **audio** into a shared 512-dimensional vector space. It combines OpenAI's [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) (text + image) with LAION's [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (audio) through a trained linear projection.
Built by [antflydb](https://github.com/antflydb) for use with [Termite](https://github.com/antflydb/antfly/tree/main/termite), a standalone ML inference service for embeddings, chunking, and reranking.
## Architecture
```
Text ──→ CLIP text encoder ──→ text_projection ──→ 512-dim (CLIP space)
Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
Audio ──→ CLAP audio encoder ──→ audio_projection ──→ 512-dim (CLIP space)
```
- **Text & Image**: Standard CLIP ViT-B/32 encoders and projections (unchanged from `openai/clip-vit-base-patch32`).
- **Audio**: CLAP HTSAT audio encoder from `laion/larger_clap_music_and_speech`. The audio projection combines CLAP's native audio projection (1024→512) with a trained 512→512 linear layer that maps CLAP audio space into CLIP space.
All three modalities produce **512-dimensional L2-normalized embeddings** that are directly comparable via cosine similarity.
## Intended Uses
- Multimodal search (text↔image↔audio)
- Building unified media indexes with [Antfly](https://github.com/antflydb/antfly)
- Cross-modal retrieval (find images from audio queries, audio from text, etc.)
- Audio-visual content discovery
## How to Use with Termite
```bash
# Pull and run the model
termite pull clipclap
termite run
# Embed text
curl -X POST http://localhost:8082/embed \
-H "Content-Type: application/json" \
-d '{
"model": "clipclap",
"input": [
{"type": "text", "text": "a cat sitting on a windowsill"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
{"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
]
}'
```
## Training Details
### Audio Projection
The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:
1. Load audio-caption pairs from [OpenSound/AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps)
2. Encode audio through CLAP: audio encoder → audio_projection → L2 normalize
3. Encode captions through CLIP: text encoder → text_projection → L2 normalize
4. Train a 512→512 linear projection (CLAP audio → CLIP text) using CLIP-style contrastive loss (InfoNCE)
The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Training dataset | OpenSound/AudioCaps |
| Samples | 5000 audio-caption pairs |
| Epochs | 20 |
| Batch size | 256 |
| Learning rate | 1e-3 |
| Optimizer | Adam |
| Loss | Symmetric InfoNCE (temperature=0.07) |
| Train/val split | 90/10 |
### Source Models
| Component | Model |
|-----------|-------|
| CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| CLAP | [laion/larger_clap_music_and_speech](https://huggingface.co/laion/larger_clap_music_and_speech) |
## ONNX Files
| File | Description | Size |
|------|-------------|------|
| `text_model.onnx` | CLIP text encoder | ~254 MB |
| `visual_model.onnx` | CLIP visual encoder | ~330 MB |
| `text_projection.onnx` | CLIP text projection (512→512) | ~4 KB |
| `visual_projection.onnx` | CLIP visual projection (768→512) | ~6 KB |
| `audio_model.onnx` | CLAP HTSAT audio encoder | ~590 MB |
| `audio_projection.onnx` | Combined CLAP→CLIP projection (1024→512) | ~8 KB |
Additional files: `clip_config.json`, `tokenizer.json`, `preprocessor_config.json`, `projection_training_metadata.json`.
## Limitations
- **Audio duration**: Audio is truncated to ~10 seconds (inherited from CLAP)
- **Language**: Primarily English text support
- **Audio-visual alignment**: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
- **CLIP limitations**: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
- **Training data**: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains
## Citation
If you use CLIPCLAP, please cite the underlying models:
```bibtex
@inproceedings{radford2021clip,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
booktitle={ICML},
year={2021}
}
@inproceedings{wu2023clap,
title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
booktitle={ICASSP},
year={2023}
}
```
|