Japanese GPT-2 and low-resource language lessons

#3
by O96a - opened

Interesting to see a dedicated Japanese GPT-2 model still active — the arXiv:2404.01657 paper provides useful insights into Japanese language modeling.

We've been working on low-resource language models (Sudanese Arabic, African languages) and the challenges are similar: character set variations, lack of large-scale pretraining data, and tokenization issues with agglutinative morphology.

Curious: how does this compare to newer multilingual models on Japanese-specific benchmarks? There's been debate in our team about whether language-specific pretraining still provides value over large multilingual models fine-tuned on Japanese data.

Also, any observations on tokenization efficiency? We've found that subword tokenizers trained primarily on English can be highly inefficient for languages with different character sets — leading to longer sequences and slower inference.

Sign up or log in to comment