Initial upload: Gemma 4 audio encoder (304.8M USM-style Conformer)

Browse files

Files changed (4) hide show

.gitattributes +1 -0
README.md +33 -0
gemma4_speaker_similarity.png +0 -0
gemma4_tsne_speakers.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+gemma4_tsne_speakers.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -165,6 +165,39 @@ Gemma 4's E2B is a MatFormer sub-model nested inside E4B. The MatFormer architec
 - **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
 - **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
 - **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
 ## Extraction Details

 - **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
 - **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
 - **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
+- **Not a speaker encoder:** While embeddings show some speaker separation (cosine similarity gap of ~0.03), this model was not trained for speaker verification. Dedicated speaker models will significantly outperform it on speaker tasks.
+## Benchmark Results (frozen 1024-dim embeddings, linear probe)
+### Speech Commands Classification (35 classes)
+| Metric | Value |
+|---|---|
+| Linear probe accuracy | **72.0%** |
+| Random baseline | 2.9% |
+| Improvement over chance | **25×** |
+| Dataset | Google Speech Commands v0.02 (validation) |
+| Probe | Logistic regression on L2-normalized mean-pooled embeddings |
+The encoder captures rich phonetic and semantic content — strong on acoustically distinct words (seven: 0.93 F1, house/stop/eight: 0.89 F1) and weaker on similar-sounding pairs (three/tree).
+### Speaker Similarity (LibriSpeech test-clean)
+| Metric | Value |
+|---|---|
+| Same-speaker cosine similarity | 0.656 ± 0.147 |
+| Different-speaker cosine similarity | 0.622 ± 0.132 |
+| Separation gap | 0.034 |
+Modest speaker separation — expected since this is an ASR-oriented encoder, not a speaker verification model.
+![Speaker Similarity Distribution](gemma4_speaker_similarity.png)
+### t-SNE Speaker Clustering
+![t-SNE Speaker Embeddings](gemma4_tsne_speakers.png)
+Embeddings show partial speaker clustering — the encoder captures speaker characteristics as a byproduct of ASR training, but is not optimized for speaker discrimination.
 ## Extraction Details

gemma4_speaker_similarity.png ADDED Viewed

gemma4_tsne_speakers.png ADDED Viewed

Git LFS Details

SHA256: 95f60ef8f3570e9a0a27039c0a5d8851393ca053b1748f8cfe8ee6f86340a5d6
Pointer size: 131 Bytes
Size of remote file: 193 kB