snoop2head's picture
Create README.md
7b949b9

scibert-wechsel-korean

Scibert(🇺🇸) converted into Korean(🇰🇷) using WECHSEL technique.

Description

  • SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens.
  • Wechsel is converting embedding layer's subword tokens from source language to target language.
  • SciBERT trained with English language is converted into Korean langauge using Wechsel technique.
  • Korean tokenizer is selected with KLUE PLMs' tokenizers due to its similar vocab size(32000) and performance.

Reference