rnagabh commited on
Commit
44b066b
·
verified ·
1 Parent(s): ee814f0

Initial upload: Gemma 4 audio encoder (304.8M USM-style Conformer)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ gemma4_tsne_speakers.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -165,6 +165,39 @@ Gemma 4's E2B is a MatFormer sub-model nested inside E4B. The MatFormer architec
165
  - **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
166
  - **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
167
  - **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  ## Extraction Details
170
 
 
165
  - **Causal chunked attention:** The encoder uses right_context=0, meaning it cannot look ahead. This limits its use in offline/non-streaming settings compared to bidirectional encoders.
166
  - **Multi-layer fusion doesn't help:** Unlike wav2vec2/W2v-BERT where combining multiple hidden layers improves downstream performance, this encoder's Macaron half-step residuals and causal attention mean only the final layer output is useful.
167
  - **Subsampling frontend uses ReLU + LayerNorm** (not SiLU + GroupNorm as in some USM descriptions).
168
+ - **Not a speaker encoder:** While embeddings show some speaker separation (cosine similarity gap of ~0.03), this model was not trained for speaker verification. Dedicated speaker models will significantly outperform it on speaker tasks.
169
+
170
+ ## Benchmark Results (frozen 1024-dim embeddings, linear probe)
171
+
172
+ ### Speech Commands Classification (35 classes)
173
+
174
+ | Metric | Value |
175
+ |---|---|
176
+ | Linear probe accuracy | **72.0%** |
177
+ | Random baseline | 2.9% |
178
+ | Improvement over chance | **25×** |
179
+ | Dataset | Google Speech Commands v0.02 (validation) |
180
+ | Probe | Logistic regression on L2-normalized mean-pooled embeddings |
181
+
182
+ The encoder captures rich phonetic and semantic content — strong on acoustically distinct words (seven: 0.93 F1, house/stop/eight: 0.89 F1) and weaker on similar-sounding pairs (three/tree).
183
+
184
+ ### Speaker Similarity (LibriSpeech test-clean)
185
+
186
+ | Metric | Value |
187
+ |---|---|
188
+ | Same-speaker cosine similarity | 0.656 ± 0.147 |
189
+ | Different-speaker cosine similarity | 0.622 ± 0.132 |
190
+ | Separation gap | 0.034 |
191
+
192
+ Modest speaker separation — expected since this is an ASR-oriented encoder, not a speaker verification model.
193
+
194
+ ![Speaker Similarity Distribution](gemma4_speaker_similarity.png)
195
+
196
+ ### t-SNE Speaker Clustering
197
+
198
+ ![t-SNE Speaker Embeddings](gemma4_tsne_speakers.png)
199
+
200
+ Embeddings show partial speaker clustering — the encoder captures speaker characteristics as a byproduct of ASR training, but is not optimized for speaker discrimination.
201
 
202
  ## Extraction Details
203
 
gemma4_speaker_similarity.png ADDED
gemma4_tsne_speakers.png ADDED

Git LFS Details

  • SHA256: 95f60ef8f3570e9a0a27039c0a5d8851393ca053b1748f8cfe8ee6f86340a5d6
  • Pointer size: 131 Bytes
  • Size of remote file: 193 kB