| ---
|
| language:
|
| - ko
|
| license: apache-2.0
|
| tags:
|
| - tokenizer
|
| - korean
|
| - sentencepiece
|
| - unigram
|
| library_name: transformers
|
| pipeline_tag: text-generation
|
| ---
|
|
|
| # YunMin Korean Tokenizer (96k vocab)
|
|
|
| A Korean language tokenizer with 96,000 vocabulary size, optimized for Korean text processing.
|
|
|
| ## Files Description
|
|
|
| - `YunMin-tokenizer-96k.model` - SentencePiece model file (2.0MB)
|
| - `YunMin-tokenizer-96k.vocab` - Vocabulary file (2.0MB)
|
| - `tokenizer.json` - Hugging Face tokenizer configuration
|
| - `tokenizer_config.json` - Tokenizer configuration parameters
|
| - `special_tokens_map.json` - Special tokens mapping
|
| - `config.json` - Model configuration
|
|
|
| ## Usage
|
|
|
| ### From Hugging Face Hub
|
|
|
| ```python
|
| from transformers import PreTrainedTokenizerFast
|
|
|
| # Load the tokenizer from Hugging Face Hub
|
| tokenizer = PreTrainedTokenizerFast.from_pretrained("mrcha033/YunMin-tokenizer-96k")
|
|
|
| # Tokenize Korean text
|
| text = "μλ
νμΈμ, νκ΅μ΄ ν ν¬λμ΄μ μ
λλ€."
|
| tokens = tokenizer.tokenize(text)
|
| token_ids = tokenizer.encode(text)
|
|
|
| print(f"Tokens: {tokens}")
|
| print(f"Token IDs: {token_ids}")
|
|
|
| # Decode back to text
|
| decoded_text = tokenizer.decode(token_ids)
|
| print(f"Decoded: {decoded_text}")
|
| ```
|
|
|
| ## Special Tokens
|
|
|
| - `<unk>` - Unknown token
|
| - `<s>` - Beginning of sequence
|
| - `</s>` - End of sequence
|
| - `<pad>` - Padding token
|
|
|
| ## Vocabulary Size
|
|
|
| 96,000 tokens optimized for Korean language processing.
|
|
|
| ## Model Type
|
|
|
| Unigram language model with whitespace pre-tokenization. |