rnagabh commited on
Commit
ee814f0
·
verified ·
1 Parent(s): f0da7f6

Initial upload: Gemma 4 audio encoder (304.8M USM-style Conformer)

Browse files
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -98,12 +98,15 @@ with torch.no_grad():
98
 
99
  # Option 2: Pure audio embeddings (1024-dim) — conformer output before projection
100
  # Recommended for downstream audio tasks (classification, verification, etc.)
101
- mel = inputs["input_features"].to(dtype=torch.bfloat16, device="cuda")
102
- hidden = mel
103
- hidden = audio_tower.subsample_conv_projection(hidden)
104
- for layer in audio_tower.layers:
105
- hidden = layer(hidden)
106
- audio_embeddings = hidden # (1, 100, 1024)
 
 
 
107
  ```
108
 
109
  > **Which to use?** For audio-only tasks (classification, speaker verification, deepfake detection),
 
98
 
99
  # Option 2: Pure audio embeddings (1024-dim) — conformer output before projection
100
  # Recommended for downstream audio tasks (classification, verification, etc.)
101
+ # Use a forward hook to capture the 1024-dim input to output_proj
102
+ pre_proj_features = {}
103
+ def hook_fn(module, input, output):
104
+ pre_proj_features["hidden"] = input[0]
105
+
106
+ handle = audio_tower.output_proj.register_forward_hook(hook_fn)
107
+ _ = audio_tower(inputs["input_features"].to(dtype=torch.bfloat16, device="cuda"))
108
+ handle.remove()
109
+ audio_embeddings = pre_proj_features["hidden"] # (1, 100, 1024)
110
  ```
111
 
112
  > **Which to use?** For audio-only tasks (classification, speaker verification, deepfake detection),