--- license: mit library_name: onnxruntime tags: - onnx - multimodal - clip - clap - audio - image - text - embeddings - feature-extraction - antfly - termite pipeline_tag: feature-extraction datasets: - OpenSound/AudioCaps --- # CLIPCLAP — Unified Text + Image + Audio Embeddings CLIPCLAP is a unified multimodal embedding model that maps **text**, **images**, and **audio** into a shared 512-dimensional vector space. It combines OpenAI's [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) (text + image) with LAION's [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (audio) through a trained linear projection. Built by [antflydb](https://github.com/antflydb) for use with [Termite](https://github.com/antflydb/antfly/tree/main/termite), a standalone ML inference service for embeddings, chunking, and reranking. ## Architecture ``` Text ──→ CLIP text encoder ──→ text_projection ──→ 512-dim (CLIP space) Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space) Audio ──→ CLAP audio encoder ──→ audio_projection ──→ 512-dim (CLIP space) ``` - **Text & Image**: Standard CLIP ViT-B/32 encoders and projections (unchanged from `openai/clip-vit-base-patch32`). - **Audio**: CLAP HTSAT audio encoder from `laion/larger_clap_music_and_speech`. The audio projection combines CLAP's native audio projection (1024→512) with a trained 512→512 linear layer that maps CLAP audio space into CLIP space. All three modalities produce **512-dimensional L2-normalized embeddings** that are directly comparable via cosine similarity. ## Intended Uses - Multimodal search (text↔image↔audio) - Building unified media indexes with [Antfly](https://github.com/antflydb/antfly) - Cross-modal retrieval (find images from audio queries, audio from text, etc.) - Audio-visual content discovery ## How to Use with Termite ```bash # Pull and run the model termite pull clipclap termite run # Embed text curl -X POST http://localhost:8082/embed \ -H "Content-Type: application/json" \ -d '{ "model": "clipclap", "input": [ {"type": "text", "text": "a cat sitting on a windowsill"}, {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}, {"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}} ] }' ``` ## Training Details ### Audio Projection The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure: 1. Load audio-caption pairs from [OpenSound/AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps) 2. Encode audio through CLAP: audio encoder → audio_projection → L2 normalize 3. Encode captions through CLIP: text encoder → text_projection → L2 normalize 4. Train a 512→512 linear projection (CLAP audio → CLIP text) using CLIP-style contrastive loss (InfoNCE) The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination. ### Hyperparameters | Parameter | Value | |-----------|-------| | Training dataset | OpenSound/AudioCaps | | Samples | 5000 audio-caption pairs | | Epochs | 20 | | Batch size | 256 | | Learning rate | 1e-3 | | Optimizer | Adam | | Loss | Symmetric InfoNCE (temperature=0.07) | | Train/val split | 90/10 | ### Source Models | Component | Model | |-----------|-------| | CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | | CLAP | [laion/larger_clap_music_and_speech](https://huggingface.co/laion/larger_clap_music_and_speech) | ## ONNX Files | File | Description | Size | |------|-------------|------| | `text_model.onnx` | CLIP text encoder | ~254 MB | | `visual_model.onnx` | CLIP visual encoder | ~330 MB | | `text_projection.onnx` | CLIP text projection (512→512) | ~4 KB | | `visual_projection.onnx` | CLIP visual projection (768→512) | ~6 KB | | `audio_model.onnx` | CLAP HTSAT audio encoder | ~590 MB | | `audio_projection.onnx` | Combined CLAP→CLIP projection (1024→512) | ~8 KB | Additional files: `clip_config.json`, `tokenizer.json`, `preprocessor_config.json`, `projection_training_metadata.json`. ## Limitations - **Audio duration**: Audio is truncated to ~10 seconds (inherited from CLAP) - **Language**: Primarily English text support - **Audio-visual alignment**: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image. - **CLIP limitations**: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts - **Training data**: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains ## Citation If you use CLIPCLAP, please cite the underlying models: ```bibtex @inproceedings{radford2021clip, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others}, booktitle={ICML}, year={2021} } @inproceedings{wu2023clap, title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}, author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others}, booktitle={ICASSP}, year={2023} } ```