DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

์ผ๋ณธ์–ด BERT (BertJapanese) [[bertjapanese]]

๊ฐœ์š” [[overview]]

์ผ๋ณธ์–ด ๋ฌธ์žฅ์— ํ•™์Šต๋œ BERT ๋ชจ๋ธ ์ž…๋‹ˆ๋‹ค.

๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ํ† ํฐํ™” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๋ชจ๋ธ:

  • MeCab์™€ WordPiece๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ถ”๊ฐ€ ์˜์กด์„ฑ fugashi์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. (์ด๋Š” MeCab์˜ ๋ž˜ํผ์ž…๋‹ˆ๋‹ค.)
  • ๋ฌธ์ž ๋‹จ์œ„๋กœ ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

MecabTokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ์˜์กด์„ฑ์„ ์„ค์น˜ํ•˜๊ธฐ ์œ„ํ•ด pip install transformers["ja"] (๋˜๋Š” ์†Œ์Šค์—์„œ ์„ค์น˜ํ•˜๋Š” ๊ฒฝ์šฐ pip install -e .["ja"]) ๋ช…๋ น์„ ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ž์„ธํ•œ ๋‚ด์šฉ์€ cl-tohoku ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—์„œ ํ™•์ธํ•˜์„ธ์š”.

MeCab๊ณผ WordPiece ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ ์˜ˆ์‹œ:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "ๅพ่ผฉใฏ็Œซใงใ‚ใ‚‹ใ€‚"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] ๅพ่ผฉ ใฏ ็Œซ ใง ใ‚ใ‚‹ ใ€‚ [SEP]

>>> outputs = bertjapanese(**inputs)

๋ฌธ์ž ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ ์˜ˆ์‹œ:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "ๅพ่ผฉใฏ็Œซใงใ‚ใ‚‹ใ€‚"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] ๅพ ่ผฉ ใฏ ็Œซ ใง ใ‚ ใ‚‹ ใ€‚ [SEP]

>>> outputs = bertjapanese(**inputs)

์ด๋Š” ํ† ํฐํ™” ๋ฐฉ๋ฒ•์„ ์ œ์™ธํ•˜๊ณ ๋Š” BERT์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. API ์ฐธ์กฐ ์ •๋ณด๋Š” BERT ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์€ cl-tohoku๊ป˜์„œ ๊ธฐ์—ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

BertJapaneseTokenizer

[[autodoc]] BertJapaneseTokenizer