|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
# PersianGemmaTokenizerFast |
|
|
|
|
|
A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as [mshojaei77/PersianGemmaTokenizerFast](https://huggingface.co/mshojaei77/PersianGemmaTokenizerFast). |
|
|
|
|
|
## Overview |
|
|
|
|
|
The **PersianGemmaTokenizerFast** leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Optimized for Persian:** Tailored tokenization for Persian language constructs. |
|
|
- **Speed and Efficiency:** Built on fast tokenization libraries for quick processing. |
|
|
- **Compatibility:** Works seamlessly with the Hugging Face Transformers library. |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
Here is an example of how to use the tokenizer in your Python code: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast") |
|
|
|
|
|
# Example Persian text |
|
|
text = "سلام، حال شما چطور است؟" |
|
|
|
|
|
# Tokenize the text |
|
|
encoded = tokenizer(text) |
|
|
|
|
|
# Print token IDs and tokens |
|
|
print("Token IDs:", encoded["input_ids"]) |
|
|
print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"])) |
|
|
``` |
|
|
|
|
|
## Comparing Performance on a Paragraph of Persian Text |
|
|
|
|
|
The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance): |
|
|
|
|
|
 |
|
|
|
|
|
## Contributing |
|
|
|
|
|
Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request. |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the MIT License. |
|
|
|