AraFusion MorphBPE Tokenizer (Fanar FaraTok-dediac 76K)

Type: MorphBPE (Farasa morpheme-constrained BPE) Vocab size: 76,800 Sequence length: 512 Base: Fanar-FaraTok-dediac_76K_with_special_tokens

Special tokens used by AraFusion

Token ID Purpose
<|begin_of_text|> 0 Begin of sequence (BOS)
<|text_end|> 1 End of sequence / document separator (EOS)
<|padding|> 5 Padding
<|dialect_arb|> 22 Modern Standard Arabic dialect condition
<|dialect_ars|> 24 Najdi / Saudi dialect condition
<|dialect_arz|> 26 Egyptian dialect condition
<|reserved_special_token_0|> 62 MDLM diffusion mask token
<|reserved_special_token_1|> 63 CFG unconditional token

Usage

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("AraFusion/arafusion-morphBPE")
ids = tok("مرحبا بالعالم", add_special_tokens=False)["input_ids"]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support