Question about dataset
Hi, thank you for your contribution!
I would like to know this model is pre-trained only in Korean?
And do you have performance of this pre-trained model? -> I found it! But, I just wondering how can you improve the performance (1st CER 80% -> 3rd CER 9.35%)
My pre-trained model trained on KsponSpeech doesn't work on fine-tuning, so I just ask about it.
Thank you in advance!
Hello! Thank you for your interest in our model.
To answer your question, the model was pre-trained exclusively on Korean speech data provided by AI Hub. While there's a small chance some other languages are mixed in, the amount would be negligible as the data was either filtered or generated from scripted readings.
I can't find the exact performance records from back then, but I do recall that it performed better on Korean tasks compared to models pre-trained on English.
As for improving performance, you could consider using a larger model. Additionally, if the data you're using for fine-tuning is significantly different from the pre-training data (for example, if it involves various dialects), applying Continual Pre-training could be an effective approach.
If you'd like, I can try to find the code we used for training and share it with you.
Thank you for your answer.
I misunderstood the scores of other person's fine-tuned models which are pre-trained on yours.
So do you have any fine-tuned model on this pre-trained model(hubert-base-korean)? If you could, can you share results?
I really appreciate your explanation!