gemma-4-E4B-it-audio-encoder
Audio encoder extracted from google/gemma-4-E4B-it.
Requires transformers>=5.5.0.
How to use
from transformers import AutoFeatureExtractor, AutoModel
import librosa
import torch
model_id = "Aratako/gemma-4-E4B-it-audio-encoder"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
encoder = AutoModel.from_pretrained(model_id, trust_remote_code=True, dtype=torch.bfloat16).cuda()
y, sr = librosa.load("test.mp3", sr=feature_extractor.sampling_rate)
features = feature_extractor([y], return_tensors="pt", sampling_rate=feature_extractor.sampling_rate)
features["input_features"] = features["input_features"].to(device="cuda", dtype=torch.bfloat16)
features["input_features_mask"] = features["input_features_mask"].cuda()
# With projection to LLM embedding space (2560-dim)
output, mask = encoder(**features, project=True)
print(output.shape) # [B, L, 2560]
# Without projection, raw encoder output (1536-dim)
output, mask = encoder(**features, project=False)
print(output.shape) # [B, L, 1536]
Output modes
project |
Output dim | Description |
|---|---|---|
True (default) |
2560 | Projected to Gemma 4 E4B LLM embedding space |
False |
1536 | Raw audio encoder output |
Acknowledgements
Inspired by mesolitica/gemma-3n-e4b-it-audio-encoder.
- Downloads last month
- 230
Model tree for Aratako/gemma-4-E4B-it-audio-encoder
Base model
google/gemma-4-E4B-it