gemma-4-E4B-it-audio-encoder

Audio encoder extracted from google/gemma-4-E4B-it.

Requires transformers>=5.5.0.

How to use

from transformers import AutoFeatureExtractor, AutoModel
import librosa
import torch

model_id = "Aratako/gemma-4-E4B-it-audio-encoder"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
encoder = AutoModel.from_pretrained(model_id, trust_remote_code=True, dtype=torch.bfloat16).cuda()

y, sr = librosa.load("test.mp3", sr=feature_extractor.sampling_rate)
features = feature_extractor([y], return_tensors="pt", sampling_rate=feature_extractor.sampling_rate)
features["input_features"] = features["input_features"].to(device="cuda", dtype=torch.bfloat16)
features["input_features_mask"] = features["input_features_mask"].cuda()

# With projection to LLM embedding space (2560-dim)
output, mask = encoder(**features, project=True)
print(output.shape)  # [B, L, 2560]

# Without projection, raw encoder output (1536-dim)
output, mask = encoder(**features, project=False)
print(output.shape)  # [B, L, 1536]

Output modes

project Output dim Description
True (default) 2560 Projected to Gemma 4 E4B LLM embedding space
False 1536 Raw audio encoder output

Acknowledgements

Inspired by mesolitica/gemma-3n-e4b-it-audio-encoder.

Downloads last month
230
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Aratako/gemma-4-E4B-it-audio-encoder

Finetuned
(42)
this model