On TIMIT

#2
by istomin9192 - opened

On TIMIT, I was only able to achieve 29% PER with this model.

Hi @istomin9192 ,

thank you for trying out this model. I have done evaluations with TIMIT a few years ago as well with this model, but cannot recall the precise numbers I got. But they were better 🤗
What I do remember though is the effort I had to put in to match TIMIT's set of phonemes to my work's set of phones. I assume you already use the reduced set of phonemes for TIMIT (e.g. collapsing closure/burst for stop phonemes), you'd still have to account of diphtongs for example. This model doesn't recognize diphtongs, as those are primarily language-specific. Instead, this model will recognize each individual phone. I know TIMIT quite well, and I assume that the high PER you observed resulted from some mismatch in the ground truth annotation.

Thank you for your reply!

First, I tried to convert TIMIT to the IPA alphabet.
I merged closure + burst stop phones using this rule:

bcl + b = b
tcl + t = t
...

and so on. I am not sure this is the correct approach - it might be better to simply ignore the closure phones.
I also simplified the following cases:

'eng' → 'ŋ'      # washington:  w aa sh ENG tcl t ax n
'nx'  → 'n'      # winner:      w ih NX axr
'dx'  → 't'      # muddy, dirty: m ah DX iy, dcl d er DX iy

I am also not completely sure about 'q' (IPA: ʔ).
However, I kept it because the model knows this phone, even though it never predicted it.

For Wav2Vec2_CommonPhone, I applied fairly simple transformations, trying to preserve as much result as possible:

# affricates
if s == 't' and s2 == 'ʃ':
    out.append('tʃ')

if s == 'd' and s2 == 'ʒ':
    out.append('dʒ')
                                
# diphthongs
if s == 'e' and s2 == 'ɪ':
    out.append('eɪ')
#.. the same for rest 'aɪ', 'aʊ', 'ɔɪ', 'oʊ'

# r-colored vowels
if s == 'ɜː' and s2 == 'r':
    out.append('ɝː')
if s == 'ə' and s2 == 'r':
    out.append('ɚ')
if s == 'ɔ':
    out.append('ɔː')

In the end, the model on TIMIT produced only five phones that are not part of American English (i.e., they do not appear in TIMIT): ['a', 'aː', 'œ', 'ɒ', 'ɜː'].
So the issue is probably not related to the mapping.

I tried to understand why this happened. One possible reason is that G2P_MAUS, which you used for phonetic annotation, might actually produce phonemes rather than phones, while TIMIT is annotated with phones.
What do you think?

For example, here is a predicted output from the model and the ground truth from TIMIT:

TXT: She had your dark suit in greasy wash water all year.
PR: ʃiː hæd jʊ  dɑːk  suːt ɪn griːsɪ  wɒʃ  wɔːtə   ɔː  jɪə
GT: ʃiː hæd jɝː dɑːrk suːt ɪŋ griːsiː wɑːʃ wɑːtɝː ʔ ɔːl jiːɚ

Here is result from the MAUS:
image_2026-03-13_002009544

I am not so sure about the conversion of r-colored vowels, the rest looks good to me. Just dropping the closures of stop sounds is completely fine as long as alignment is out of question. The deviations you observed could truly come from a phoneme-bias in MAUS. I know that the system combines recognized pronunciation with a g2p approach (which is phoneme-level for sure). I had this discussion once with a colleague related to something you also observed here, which is the missing glottal stop. I continued working on recovering such patterns in an unsupervised fashion after working on Common Phone and this model, but had to drop my work due to other obligations.

I have worked a lot with TIMIT, and I know there are a few segments for which I don't fully agree with the provided annotation (but put this note aside, the annotation is incredibly high quality).

Got it, thank you for your reply.

  1. By the way, in my experiments on TIMIT, facebook/wav2vec2-xlsr-53-espeak-cv-ft produced very similar results with its original head (PER ≈ 27%).

  2. Instead of using the built-in head in Wav2Vec2_CommonPhone, I trained new frame-based head on TIMIT with a frozen encoder.
    Since TIMIT has fewer classes, a better result was expected - the PER dropped to 18%. However, a BiLSTM head on top of the frozen encoder performed significantly better, achieving 13% PER.

This suggests that, in this case, the latent representations are not clearly separable at the single-frame level. In other words, some phones in the latent space seem to form not compact clusters, but rather overlapping trajectories over time. This may be a property inherited from the original model. This was quite an interesting observation for me.

Interesting observation for the 2nd experiment! I also have some idea why you got such results. If I understand you correctly, you optimized w.r.t. time-aligned ground-truth from TIMIT on a frame-wise level. Hence, your per-time-step probability mass is distributed among the phonetic symbols in your inventory. My model (and the original Wav2Vec2) were not fine-tuned in a frame-wise fashion, but rather using CTC. In CTC, a large amount of the probability mass is attributed to the blank token if you look at a full sequence. Traditionally, the number of observable states - in your case phones - is (much) smaller than the number of time-steps, and CTC-optimized models will often predict blank tokens if the current state hasn't changed. This will certainly also influence the way the encoder builds its representations, and could consequently affect your results if you then try to do frame-wise classification.

Why does the BiLSTM now perform better? Because it "recovers" the label information over time at those frames where the encoder yielded something that was closer to a blank token. It essentially smoothed the representations again to kind of filter out that default class.

If you are interested in CTC details, I strongly recommend checking out this article from Awni Hannun: Sequence Modeling With CTC

Yes, in my setup I freeze the backbone and train either a linear classifier or an LSTM head using time-aligned ground-truth from TIMIT in a frame-wise manner.
I get your point about CTC peaky behavior. There's even paper discussing this: https://arxiv.org/pdf/2105.14849
The CTC head from Wav2Vec2_CommonPhone does produce typically one peak per phone, with the remaining frames assigned to the blank token.
As I understand your hypothesis is that this behavior might have propagated into the backbone during fine-tuning.

However, I also repeated the same experiment using purely SSL encoders - facebook/wav2vec2-xls-r-300m and mHuBERT-147.
They were not fine-tuned with CTC at all. The effect persists: a frozen encoder + LSTM head significantly outperforms a frame-wise classifier.

Maybe phones in the latent space are not well-separated frame-wise clusters, but rather form overlapping trajectories over time.
Or maybe it's simply that the context of neighboring phones is taken into account, and the LSTM helps to resolve it.
I'll look for more papers. And the article by Awni Hannun is perfect.

I think what you observe makes perfect sense. And your reasoning behind it appears correct. A single phone is just some definition of a sound we use during speech production. If we go deeper, that sound is not the same from start to beginning. All audio models I am aware of that were optimized in a SSL fashion usually applied an inventory that was quite a bit larger than the expected number of symbols in something like a phone(me) recognition task. And with contributors like a codebook diversity loss (in case of Wav2VEc2), the model will use the full inventory, and not just collapse it down to a representation we humans deem meaningful.

That being said, any RNN can learn how a single phone(me) transitions, for example from onset to offset, closure to burst, whatever it is. And this entire cluster will then be recognized as "the class". A simple frame-wise classifier only has access to this very moment, and would consequently have to operate without context. Now let's look at the very moment of a closure right before the burst for a stop sound. Only looking at this small window, you will be able to tell whether there is some voiced component, but you will not be able to tell me what kind of burst is in the next frame. Is it a /p/, a /t/, or a /k/ in case of an unvoiced stop? That's just one example where context is important. And speaking of a language, any RNN will also learn what class transitions are more or less likely.

Sign up or log in to comment