tranhuyHoang's picture
Update README.md
0901c40 verified
metadata
license: apache-2.0
datasets:
  - VTSNLP/vietnamese_curated_dataset
language:
  - vi
tags:
  - tokenizer
  - vietnamese
  - byte-bpe
  - causal-lm
  - nlp

Vietnamese Tokenizer

This repository contains a ByteLevel BPE tokenizer trained from scratch specifically for the Vietnamese language, designed for decoder-only language model pretraining.


๐Ÿš€ Usage

Load tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "tranhuyHoang/mini_VN_decoder_tokenizer",
    use_fast=True
)