Initial upload: Gemma 4 audio encoder (304.8M USM-style Conformer)
Browse files
README.md
CHANGED
|
@@ -98,12 +98,15 @@ with torch.no_grad():
|
|
| 98 |
|
| 99 |
# Option 2: Pure audio embeddings (1024-dim) — conformer output before projection
|
| 100 |
# Recommended for downstream audio tasks (classification, verification, etc.)
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
|
|
|
|
|
|
|
|
|
| 107 |
```
|
| 108 |
|
| 109 |
> **Which to use?** For audio-only tasks (classification, speaker verification, deepfake detection),
|
|
|
|
| 98 |
|
| 99 |
# Option 2: Pure audio embeddings (1024-dim) — conformer output before projection
|
| 100 |
# Recommended for downstream audio tasks (classification, verification, etc.)
|
| 101 |
+
# Use a forward hook to capture the 1024-dim input to output_proj
|
| 102 |
+
pre_proj_features = {}
|
| 103 |
+
def hook_fn(module, input, output):
|
| 104 |
+
pre_proj_features["hidden"] = input[0]
|
| 105 |
+
|
| 106 |
+
handle = audio_tower.output_proj.register_forward_hook(hook_fn)
|
| 107 |
+
_ = audio_tower(inputs["input_features"].to(dtype=torch.bfloat16, device="cuda"))
|
| 108 |
+
handle.remove()
|
| 109 |
+
audio_embeddings = pre_proj_features["hidden"] # (1, 100, 1024)
|
| 110 |
```
|
| 111 |
|
| 112 |
> **Which to use?** For audio-only tasks (classification, speaker verification, deepfake detection),
|