Question about dataset

#10

by ppakji - opened Oct 14, 2025

Discussion

ppakji

Oct 14, 2025

•

edited Oct 14, 2025

Hi, thank you for your contribution!

I would like to know this model is pre-trained only in Korean?
And do you have performance of this pre-trained model? -> I found it! But, I just wondering how can you improve the performance (1st CER 80% -> 3rd CER 9.35%)
My pre-trained model trained on KsponSpeech doesn't work on fine-tuning, so I just ask about it.

Thank you in advance!

hyunwoo3235

lucid org Oct 14, 2025

Hello! Thank you for your interest in our model.

To answer your question, the model was pre-trained exclusively on Korean speech data provided by AI Hub. While there's a small chance some other languages are mixed in, the amount would be negligible as the data was either filtered or generated from scripted readings.
I can't find the exact performance records from back then, but I do recall that it performed better on Korean tasks compared to models pre-trained on English.

As for improving performance, you could consider using a larger model. Additionally, if the data you're using for fine-tuning is significantly different from the pre-training data (for example, if it involves various dialects), applying Continual Pre-training could be an effective approach.
If you'd like, I can try to find the code we used for training and share it with you.

ppakji

Oct 15, 2025

•

edited Oct 16, 2025

Thank you for your answer.

I misunderstood the scores of other person's fine-tuned models which are pre-trained on yours.
So do you have any fine-tuned model on this pre-trained model(hubert-base-korean)? If you could, can you share results?

I really appreciate your explanation!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment