CarpenterAnt91 commited on
Commit
2530c24
·
verified ·
1 Parent(s): a319dc7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: onnxruntime
4
+ tags:
5
+ - onnx
6
+ - multimodal
7
+ - clip
8
+ - clap
9
+ - audio
10
+ - image
11
+ - text
12
+ - embeddings
13
+ - feature-extraction
14
+ - antfly
15
+ - termite
16
+ pipeline_tag: feature-extraction
17
+ datasets:
18
+ - OpenSound/AudioCaps
19
+ ---
20
+
21
+ # CLIPCLAP — Unified Text + Image + Audio Embeddings
22
+
23
+ CLIPCLAP is a unified multimodal embedding model that maps **text**, **images**, and **audio** into a shared 512-dimensional vector space. It combines OpenAI's [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) (text + image) with LAION's [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (audio) through a trained linear projection.
24
+
25
+ Built by [antflydb](https://github.com/antflydb) for use with [Termite](https://github.com/antflydb/antfly/tree/main/termite), a standalone ML inference service for embeddings, chunking, and reranking.
26
+
27
+ ## Architecture
28
+
29
+ ```
30
+ Text ──→ CLIP text encoder ──→ text_projection ──→ 512-dim (CLIP space)
31
+ Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
32
+ Audio ──→ CLAP audio encoder ──→ audio_projection ──→ 512-dim (CLIP space)
33
+ ```
34
+
35
+ - **Text & Image**: Standard CLIP ViT-B/32 encoders and projections (unchanged from `openai/clip-vit-base-patch32`).
36
+ - **Audio**: CLAP HTSAT audio encoder from `laion/larger_clap_music_and_speech`. The audio projection combines CLAP's native audio projection (1024→512) with a trained 512→512 linear layer that maps CLAP audio space into CLIP space.
37
+
38
+ All three modalities produce **512-dimensional L2-normalized embeddings** that are directly comparable via cosine similarity.
39
+
40
+ ## Intended Uses
41
+
42
+ - Multimodal search (text↔image↔audio)
43
+ - Building unified media indexes with [Antfly](https://github.com/antflydb/antfly)
44
+ - Cross-modal retrieval (find images from audio queries, audio from text, etc.)
45
+ - Audio-visual content discovery
46
+
47
+ ## How to Use with Termite
48
+
49
+ ```bash
50
+ # Pull and run the model
51
+ termite pull clipclap
52
+ termite run
53
+
54
+ # Embed text
55
+ curl -X POST http://localhost:8082/embed \
56
+ -H "Content-Type: application/json" \
57
+ -d '{
58
+ "model": "clipclap",
59
+ "input": [
60
+ {"type": "text", "text": "a cat sitting on a windowsill"},
61
+ {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
62
+ {"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
63
+ ]
64
+ }'
65
+ ```
66
+
67
+ ## Training Details
68
+
69
+ ### Audio Projection
70
+
71
+ The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:
72
+
73
+ 1. Load audio-caption pairs from [OpenSound/AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps)
74
+ 2. Encode audio through CLAP: audio encoder → audio_projection → L2 normalize
75
+ 3. Encode captions through CLIP: text encoder → text_projection → L2 normalize
76
+ 4. Train a 512→512 linear projection (CLAP audio → CLIP text) using CLIP-style contrastive loss (InfoNCE)
77
+
78
+ The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.
79
+
80
+ ### Hyperparameters
81
+
82
+ | Parameter | Value |
83
+ |-----------|-------|
84
+ | Training dataset | OpenSound/AudioCaps |
85
+ | Samples | 5000 audio-caption pairs |
86
+ | Epochs | 20 |
87
+ | Batch size | 256 |
88
+ | Learning rate | 1e-3 |
89
+ | Optimizer | Adam |
90
+ | Loss | Symmetric InfoNCE (temperature=0.07) |
91
+ | Train/val split | 90/10 |
92
+
93
+ ### Source Models
94
+
95
+ | Component | Model |
96
+ |-----------|-------|
97
+ | CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
98
+ | CLAP | [laion/larger_clap_music_and_speech](https://huggingface.co/laion/larger_clap_music_and_speech) |
99
+
100
+ ## ONNX Files
101
+
102
+ | File | Description | Size |
103
+ |------|-------------|------|
104
+ | `text_model.onnx` | CLIP text encoder | ~254 MB |
105
+ | `visual_model.onnx` | CLIP visual encoder | ~330 MB |
106
+ | `text_projection.onnx` | CLIP text projection (512→512) | ~4 KB |
107
+ | `visual_projection.onnx` | CLIP visual projection (768→512) | ~6 KB |
108
+ | `audio_model.onnx` | CLAP HTSAT audio encoder | ~590 MB |
109
+ | `audio_projection.onnx` | Combined CLAP→CLIP projection (1024→512) | ~8 KB |
110
+
111
+ Additional files: `clip_config.json`, `tokenizer.json`, `preprocessor_config.json`, `projection_training_metadata.json`.
112
+
113
+ ## Limitations
114
+
115
+ - **Audio duration**: Audio is truncated to ~10 seconds (inherited from CLAP)
116
+ - **Language**: Primarily English text support
117
+ - **Audio-visual alignment**: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
118
+ - **CLIP limitations**: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
119
+ - **Training data**: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains
120
+
121
+ ## Citation
122
+
123
+ If you use CLIPCLAP, please cite the underlying models:
124
+
125
+ ```bibtex
126
+ @inproceedings{radford2021clip,
127
+ title={Learning Transferable Visual Models From Natural Language Supervision},
128
+ author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
129
+ booktitle={ICML},
130
+ year={2021}
131
+ }
132
+
133
+ @inproceedings{wu2023clap,
134
+ title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
135
+ author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
136
+ booktitle={ICASSP},
137
+ year={2023}
138
+ }
139
+ ```