clipclap / README.md

Create README.md

2530c24 verified 11 days ago

5.42 kB

	---
	license: mit
	library_name: onnxruntime
	tags:
	- onnx
	- multimodal
	- clip
	- clap
	- audio
	- image
	- text
	- embeddings
	- feature-extraction
	- antfly
	- termite
	pipeline_tag: feature-extraction
	datasets:
	- OpenSound/AudioCaps
	---

	# CLIPCLAP — Unified Text + Image + Audio Embeddings

	CLIPCLAP is a unified multimodal embedding model that maps text, images, and audio into a shared 512-dimensional vector space. It combines OpenAI's [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) (text + image) with LAION's [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (audio) through a trained linear projection.

	Built by [antflydb](https://github.com/antflydb) for use with [Termite](https://github.com/antflydb/antfly/tree/main/termite), a standalone ML inference service for embeddings, chunking, and reranking.

	## Architecture

	```
	Text ──→ CLIP text encoder ──→ text_projection ──→ 512-dim (CLIP space)
	Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
	Audio ──→ CLAP audio encoder ──→ audio_projection ──→ 512-dim (CLIP space)
	```

	- Text & Image: Standard CLIP ViT-B/32 encoders and projections (unchanged from `openai/clip-vit-base-patch32`).
	- Audio: CLAP HTSAT audio encoder from `laion/larger_clap_music_and_speech`. The audio projection combines CLAP's native audio projection (1024→512) with a trained 512→512 linear layer that maps CLAP audio space into CLIP space.

	All three modalities produce 512-dimensional L2-normalized embeddings that are directly comparable via cosine similarity.

	## Intended Uses

	- Multimodal search (text↔image↔audio)
	- Building unified media indexes with [Antfly](https://github.com/antflydb/antfly)
	- Cross-modal retrieval (find images from audio queries, audio from text, etc.)
	- Audio-visual content discovery

	## How to Use with Termite

	```bash
	# Pull and run the model
	termite pull clipclap
	termite run

	# Embed text
	curl -X POST http://localhost:8082/embed \
	-H "Content-Type: application/json" \
	-d '{
	"model": "clipclap",
	"input": [
	{"type": "text", "text": "a cat sitting on a windowsill"},
	{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
	{"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
	]
	}'
	```

	## Training Details

	### Audio Projection

	The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:

	1. Load audio-caption pairs from [OpenSound/AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps)
	2. Encode audio through CLAP: audio encoder → audio_projection → L2 normalize
	3. Encode captions through CLIP: text encoder → text_projection → L2 normalize
	4. Train a 512→512 linear projection (CLAP audio → CLIP text) using CLIP-style contrastive loss (InfoNCE)

	The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training dataset \| OpenSound/AudioCaps \|
	\| Samples \| 5000 audio-caption pairs \|
	\| Epochs \| 20 \|
	\| Batch size \| 256 \|
	\| Learning rate \| 1e-3 \|
	\| Optimizer \| Adam \|
	\| Loss \| Symmetric InfoNCE (temperature=0.07) \|
	\| Train/val split \| 90/10 \|

	### Source Models

	\| Component \| Model \|
	\|-----------\|-------\|
	\| CLIP \| [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) \|
	\| CLAP \| [laion/larger_clap_music_and_speech](https://huggingface.co/laion/larger_clap_music_and_speech) \|

	## ONNX Files

	\| File \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `text_model.onnx` \| CLIP text encoder \| ~254 MB \|
	\| `visual_model.onnx` \| CLIP visual encoder \| ~330 MB \|
	\| `text_projection.onnx` \| CLIP text projection (512→512) \| ~4 KB \|
	\| `visual_projection.onnx` \| CLIP visual projection (768→512) \| ~6 KB \|
	\| `audio_model.onnx` \| CLAP HTSAT audio encoder \| ~590 MB \|
	\| `audio_projection.onnx` \| Combined CLAP→CLIP projection (1024→512) \| ~8 KB \|

	Additional files: `clip_config.json`, `tokenizer.json`, `preprocessor_config.json`, `projection_training_metadata.json`.

	## Limitations

	- Audio duration: Audio is truncated to ~10 seconds (inherited from CLAP)
	- Language: Primarily English text support
	- Audio-visual alignment: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
	- CLIP limitations: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
	- Training data: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains

	## Citation

	If you use CLIPCLAP, please cite the underlying models:

	```bibtex
	@inproceedings{radford2021clip,
	title={Learning Transferable Visual Models From Natural Language Supervision},
	author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
	booktitle={ICML},
	year={2021}
	}

	@inproceedings{wu2023clap,
	title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
	author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
	booktitle={ICASSP},
	year={2023}
	}
	```