Question about dataset

#10
by ppakji - opened

Hi, thank you for your contribution!

I would like to know this model is pre-trained only in Korean?
And do you have performance of this pre-trained model? -> I found it! But, I just wondering how can you improve the performance (1st CER 80% -> 3rd CER 9.35%)
My pre-trained model trained on KsponSpeech doesn't work on fine-tuning, so I just ask about it.

Thank you in advance!

Hello! Thank you for your interest in our model.

To answer your question, the model was pre-trained exclusively on Korean speech data provided by AI Hub. While there's a small chance some other languages are mixed in, the amount would be negligible as the data was either filtered or generated from scripted readings.
I can't find the exact performance records from back then, but I do recall that it performed better on Korean tasks compared to models pre-trained on English.

As for improving performance, you could consider using a larger model. Additionally, if the data you're using for fine-tuning is significantly different from the pre-training data (for example, if it involves various dialects), applying Continual Pre-training could be an effective approach.
If you'd like, I can try to find the code we used for training and share it with you.

Thank you for your answer.

I misunderstood the scores of other person's fine-tuned models which are pre-trained on yours.
So do you have any fine-tuned model on this pre-trained model(hubert-base-korean)? If you could, can you share results?

I really appreciate your explanation!

Sign up or log in to comment