mshojaei77's picture
Update README.md
e866fe5 verified
---
library_name: transformers
tags: []
---
# PersianGemmaTokenizerFast
A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as [mshojaei77/PersianGemmaTokenizerFast](https://huggingface.co/mshojaei77/PersianGemmaTokenizerFast).
## Overview
The **PersianGemmaTokenizerFast** leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text.
## Features
- **Optimized for Persian:** Tailored tokenization for Persian language constructs.
- **Speed and Efficiency:** Built on fast tokenization libraries for quick processing.
- **Compatibility:** Works seamlessly with the Hugging Face Transformers library.
## Usage
Here is an example of how to use the tokenizer in your Python code:
```python
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast")
# Example Persian text
text = "سلام، حال شما چطور است؟"
# Tokenize the text
encoded = tokenizer(text)
# Print token IDs and tokens
print("Token IDs:", encoded["input_ids"])
print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"]))
```
## Comparing Performance on a Paragraph of Persian Text
The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance):
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6556b1bb85d43542fa1a8f91/lZJKqsi4BZ8mJiY_I-vhA.png)
## Contributing
Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request.
## License
This project is licensed under the MIT License.