mrcha033's picture
Upload folder using huggingface_hub
072c33a verified
---
language:
- ko
license: apache-2.0
tags:
- tokenizer
- korean
- sentencepiece
- unigram
library_name: transformers
pipeline_tag: text-generation
---
# YunMin Korean Tokenizer (96k vocab)
A Korean language tokenizer with 96,000 vocabulary size, optimized for Korean text processing.
## Files Description
- `YunMin-tokenizer-96k.model` - SentencePiece model file (2.0MB)
- `YunMin-tokenizer-96k.vocab` - Vocabulary file (2.0MB)
- `tokenizer.json` - Hugging Face tokenizer configuration
- `tokenizer_config.json` - Tokenizer configuration parameters
- `special_tokens_map.json` - Special tokens mapping
- `config.json` - Model configuration
## Usage
### From Hugging Face Hub
```python
from transformers import PreTrainedTokenizerFast
# Load the tokenizer from Hugging Face Hub
tokenizer = PreTrainedTokenizerFast.from_pretrained("mrcha033/YunMin-tokenizer-96k")
# Tokenize Korean text
text = "μ•ˆλ…•ν•˜μ„Έμš”, ν•œκ΅­μ–΄ ν† ν¬λ‚˜μ΄μ €μž…λ‹ˆλ‹€."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded: {decoded_text}")
```
## Special Tokens
- `<unk>` - Unknown token
- `<s>` - Beginning of sequence
- `</s>` - End of sequence
- `<pad>` - Padding token
## Vocabulary Size
96,000 tokens optimized for Korean language processing.
## Model Type
Unigram language model with whitespace pre-tokenization.