AraFusion MorphBPE Tokenizer (Fanar FaraTok-dediac 76K)
Type: MorphBPE (Farasa morpheme-constrained BPE) Vocab size: 76,800 Sequence length: 512 Base: Fanar-FaraTok-dediac_76K_with_special_tokens
Special tokens used by AraFusion
| Token | ID | Purpose |
|---|---|---|
<|begin_of_text|> |
0 | Begin of sequence (BOS) |
<|text_end|> |
1 | End of sequence / document separator (EOS) |
<|padding|> |
5 | Padding |
<|dialect_arb|> |
22 | Modern Standard Arabic dialect condition |
<|dialect_ars|> |
24 | Najdi / Saudi dialect condition |
<|dialect_arz|> |
26 | Egyptian dialect condition |
<|reserved_special_token_0|> |
62 | MDLM diffusion mask token |
<|reserved_special_token_1|> |
63 | CFG unconditional token |
Usage
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("AraFusion/arafusion-morphBPE")
ids = tok("مرحبا بالعالم", add_special_tokens=False)["input_ids"]
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support