gsaltintas's picture
Upload folder using huggingface_hub
1fe3a86 verified
metadata
license: mit
language:
  - dan
  - deu
  - nld
  - swe
tags:
  - tokenizer
  - bpe
  - flexitok
  - fineweb2

Byte-Level BPE Tokenizer: ['dan_Latn', 'deu_Latn', 'nld_Latn', 'swe_Latn'] (32K)

A Byte-Level BPE tokenizer trained on ['dan_Latn', 'deu_Latn', 'nld_Latn', 'swe_Latn'] data from Fineweb-2-HQ.

Training Details

Parameter Value
Algorithm Byte-Level BPE
Language ['dan_Latn', 'deu_Latn', 'nld_Latn', 'swe_Latn']
Target Vocab Size 32,000
Final Vocab Size 32,865
Pre-tokenizer custom:dan_Latn
Number handling ltr_3digit
Contraction handling True
Normalizer NFC
Special Tokens <s>, </s>, <pad>, <unk>
Training Shards 8

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_script_Germ_32000")
tokens = tokenizer.encode("Hello, world!")

Files

  • tokenizer.json — Full HuggingFace tokenizer
  • vocab.json — Vocabulary mapping
  • merges.txt — BPE merge rules

Sample Encoding

Text Tokens Token IDs
Hello, world! 12345 This is a test. こんにちは H, ello, ,, Ġworld, !, Ġ, 123, 45, ĠThis, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ, ãĤ, ĵ, ãģ«, ãģ 42, 13486, 14, 21745, 3, 223, 19219, 3832, 16775, 516, 270, 5190, 16, 223, 3768, 244, 5986, 244, 30698, 3768