About the downsampling rate
Hi, thanks for the great work!
I have a question about the downsampling rate.
So let's say we have a waveform of T sample points.
The feature_extractor first extracts some kind of feature sequence (I guess mel-spectrograms?), with the downsampling rate being 160 and the feature dimension being 128. That is to say, the feature sequence has the shape[T/160, 128].
Then, the audio_encoder further encodes the feature into audio_encodings, which has the shape [T/160/16, 1536]
However, according to the Gemma 3n guide (https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#:~:text=Gemma%203n%20uses%20an%20advanced%20audio%20encoder%20based%20on%20the%20Universal%20Speech%20Model%20(USM).),
The encoder generates a token for every 160ms of audio (about 6 tokens per second), ...
I first thought I would get some feature of shape [T/160, 1536], but it turns out it was further downsampled.
I wonder if this is the expected behavior?
Sorry, after thinking twice, I think it's the expected behavior. My bad... closing this one.