HuaminChen commited on
Commit
b4136e7
Β·
verified Β·
1 Parent(s): e6b2249

Fix Whisper-tiny encoder param count (8M, not 39M)

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -58,7 +58,7 @@ A compact multimodal embedding model that unifies text, image, and audio represe
58
 
59
  - **Text encoding** via MiniLM-L6-v2 (22M params)
60
  - **Image encoding** via SigLIP-base-patch16-512 (86M params)
61
- - **Audio encoding** via Whisper-tiny encoder (39M params)
62
  - **Cross-modal fusion** via 2-layer transformer attention
63
  - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
64
  - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
@@ -196,7 +196,7 @@ emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1) # 6x faster retrieval
196
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
197
  β”‚ Text Encoder: MiniLM-L6-v2 (22M params, 6 layers)β”‚
198
  β”‚ Image Encoder: SigLIP-base-patch16-512 (86M params) β”‚
199
- β”‚ Audio Encoder: Whisper-tiny encoder (39M params, 4 layers)β”‚
200
  β”‚ Fusion: 2-layer Transformer β”‚
201
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
202
  β”‚ Output: 384-dim normalized embeddings β”‚
 
58
 
59
  - **Text encoding** via MiniLM-L6-v2 (22M params)
60
  - **Image encoding** via SigLIP-base-patch16-512 (86M params)
61
+ - **Audio encoding** via Whisper-tiny encoder (8M params)
62
  - **Cross-modal fusion** via 2-layer transformer attention
63
  - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
64
  - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
 
196
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
197
  β”‚ Text Encoder: MiniLM-L6-v2 (22M params, 6 layers)β”‚
198
  β”‚ Image Encoder: SigLIP-base-patch16-512 (86M params) β”‚
199
+ β”‚ Audio Encoder: Whisper-tiny encoder (8M params, 4 layers) β”‚
200
  β”‚ Fusion: 2-layer Transformer β”‚
201
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
202
  β”‚ Output: 384-dim normalized embeddings β”‚