Initial upload: Gemma 4 audio encoder (304.8M USM-style Conformer)
Browse files- .gitattributes +1 -0
- README.md +33 -0
- gemma4_speaker_similarity.png +0 -0
- gemma4_tsne_speakers.png +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
gemma4_tsne_speakers.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -165,6 +165,39 @@ Gemma 4's E2B is a MatFormer sub-model nested inside E4B. The MatFormer architec
|
|
| 165 |
- **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
|
| 166 |
- **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
|
| 167 |
- **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
## Extraction Details
|
| 170 |
|
|
|
|
| 165 |
- **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
|
| 166 |
- **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
|
| 167 |
- **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
|
| 168 |
+
- **Not a speaker encoder:** While embeddings show some speaker separation (cosine similarity gap of ~0.03), this model was not trained for speaker verification. Dedicated speaker models will significantly outperform it on speaker tasks.
|
| 169 |
+
|
| 170 |
+
## Benchmark Results (frozen 1024-dim embeddings, linear probe)
|
| 171 |
+
|
| 172 |
+
### Speech Commands Classification (35 classes)
|
| 173 |
+
|
| 174 |
+
| Metric | Value |
|
| 175 |
+
|---|---|
|
| 176 |
+
| Linear probe accuracy | **72.0%** |
|
| 177 |
+
| Random baseline | 2.9% |
|
| 178 |
+
| Improvement over chance | **25×** |
|
| 179 |
+
| Dataset | Google Speech Commands v0.02 (validation) |
|
| 180 |
+
| Probe | Logistic regression on L2-normalized mean-pooled embeddings |
|
| 181 |
+
|
| 182 |
+
The encoder captures rich phonetic and semantic content — strong on acoustically distinct words (seven: 0.93 F1, house/stop/eight: 0.89 F1) and weaker on similar-sounding pairs (three/tree).
|
| 183 |
+
|
| 184 |
+
### Speaker Similarity (LibriSpeech test-clean)
|
| 185 |
+
|
| 186 |
+
| Metric | Value |
|
| 187 |
+
|---|---|
|
| 188 |
+
| Same-speaker cosine similarity | 0.656 ± 0.147 |
|
| 189 |
+
| Different-speaker cosine similarity | 0.622 ± 0.132 |
|
| 190 |
+
| Separation gap | 0.034 |
|
| 191 |
+
|
| 192 |
+
Modest speaker separation — expected since this is an ASR-oriented encoder, not a speaker verification model.
|
| 193 |
+
|
| 194 |
+

|
| 195 |
+
|
| 196 |
+
### t-SNE Speaker Clustering
|
| 197 |
+
|
| 198 |
+

|
| 199 |
+
|
| 200 |
+
Embeddings show partial speaker clustering — the encoder captures speaker characteristics as a byproduct of ASR training, but is not optimized for speaker discrimination.
|
| 201 |
|
| 202 |
## Extraction Details
|
| 203 |
|
gemma4_speaker_similarity.png
ADDED
|
gemma4_tsne_speakers.png
ADDED
|
Git LFS Details
|