About the downsampling rate

#1
by unilight - opened

Hi, thanks for the great work!
I have a question about the downsampling rate.

So let's say we have a waveform of T sample points.
The feature_extractor first extracts some kind of feature sequence (I guess mel-spectrograms?), with the downsampling rate being 160 and the feature dimension being 128. That is to say, the feature sequence has the shape[T/160, 128].
Then, the audio_encoder further encodes the feature into audio_encodings, which has the shape [T/160/16, 1536]

However, according to the Gemma 3n guide (https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#:~:text=Gemma%203n%20uses%20an%20advanced%20audio%20encoder%20based%20on%20the%20Universal%20Speech%20Model%20(USM).),

The encoder generates a token for every 160ms of audio (about 6 tokens per second), ...

I first thought I would get some feature of shape [T/160, 1536], but it turns out it was further downsampled.

I wonder if this is the expected behavior?

Sorry, after thinking twice, I think it's the expected behavior. My bad... closing this one.

unilight changed discussion status to closed

Sign up or log in to comment