Japanese GPT-2 and low-resource language lessons

by O96a - opened 27 days ago

Interesting to see a dedicated Japanese GPT-2 model still active — the arXiv:2404.01657 paper provides useful insights into Japanese language modeling.

We've been working on low-resource language models (Sudanese Arabic, African languages) and the challenges are similar: character set variations, lack of large-scale pretraining data, and tokenization issues with agglutinative morphology.

Curious: how does this compare to newer multilingual models on Japanese-specific benchmarks? There's been debate in our team about whether language-specific pretraining still provides value over large multilingual models fine-tuned on Japanese data.

Also, any observations on tokenization efficiency? We've found that subword tokenizers trained primarily on English can be highly inefficient for languages with different character sets — leading to longer sequences and slower inference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment