mshojaei77
/

PersianGemmaTokenizerFast

Model card Files Files and versions

PersianGemmaTokenizerFast / README.md

mshojaei77's picture

Update README.md

e866fe5 verified 11 months ago

|

history blame contribute delete

2.08 kB

	---
	library_name: transformers
	tags: []
	---
	# PersianGemmaTokenizerFast

	A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as [mshojaei77/PersianGemmaTokenizerFast](https://huggingface.co/mshojaei77/PersianGemmaTokenizerFast).

	## Overview

	The PersianGemmaTokenizerFast leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text.

	## Features

	- Optimized for Persian: Tailored tokenization for Persian language constructs.
	- Speed and Efficiency: Built on fast tokenization libraries for quick processing.
	- Compatibility: Works seamlessly with the Hugging Face Transformers library.


	## Usage

	Here is an example of how to use the tokenizer in your Python code:

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast")

	# Example Persian text
	text = "سلام، حال شما چطور است؟"

	# Tokenize the text
	encoded = tokenizer(text)

	# Print token IDs and tokens
	print("Token IDs:", encoded["input_ids"])
	print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"]))
	```

	## Comparing Performance on a Paragraph of Persian Text

	The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance):

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6556b1bb85d43542fa1a8f91/lZJKqsi4BZ8mJiY_I-vhA.png)

	## Contributing

	Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request.

	## License

	This project is licensed under the MIT License.