About the downsampling rate

by unilight - opened Jul 10, 2025

Jul 10, 2025

Hi, thanks for the great work!
I have a question about the downsampling rate.

So let's say we have a waveform of T sample points.
The feature_extractor first extracts some kind of feature sequence (I guess mel-spectrograms?), with the downsampling rate being 160 and the feature dimension being 128. That is to say, the feature sequence has the shape[T/160, 128].
Then, the audio_encoder further encodes the feature into audio_encodings, which has the shape [T/160/16, 1536]

However, according to the Gemma 3n guide (https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#:~:text=Gemma%203n%20uses%20an%20advanced%20audio%20encoder%20based%20on%20the%20Universal%20Speech%20Model%20(USM).),

The encoder generates a token for every 160ms of audio (about 6 tokens per second), ...

I first thought I would get some feature of shape [T/160, 1536], but it turns out it was further downsampled.

I wonder if this is the expected behavior?

unilight

Jul 11, 2025

Sorry, after thinking twice, I think it's the expected behavior. My bad... closing this one.

unilight changed discussion status to closed Jul 11, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment