| | --- |
| | language: kn |
| | tags: |
| | - kannada |
| | - tokenizer |
| | - bpe |
| | - nlp |
| | - huggingface |
| | license: mit |
| | datasets: |
| | - Cognitive-Lab/Kannada-Instruct-dataset |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Kannada Tokenizer |
| |
|
| | [](https://huggingface.co/charanhu/kannada-tokenizer) |
| |
|
| | This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text. |
| |
|
| | ## Model Description |
| |
|
| | - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer |
| | - **Language:** Kannada (`kn`) |
| | - **Vocabulary Size:** 32,000 |
| | - **Special Tokens:** |
| | - `[UNK]` (Unknown token) |
| | - `[PAD]` (Padding token) |
| | - `[CLS]` (Classifier token) |
| | - `[SEP]` (Separator token) |
| | - `[MASK]` (Masking token) |
| | - **License:** MIT License |
| | - **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset) |
| | - **Algorithm:** Byte-Pair Encoding (BPE) |
| |
|
| | ## Intended Use |
| |
|
| | This tokenizer is intended for NLP applications involving the Kannada language, such as: |
| |
|
| | - **Language Modeling** |
| | - **Text Generation** |
| | - **Text Classification** |
| | - **Machine Translation** |
| | - **Named Entity Recognition** |
| | - **Question Answering** |
| | - **Summarization** |
| |
|
| | ## How to Use |
| |
|
| | You can load the tokenizer directly from the Hugging Face Hub: |
| |
|
| | ```python |
| | from transformers import PreTrainedTokenizerFast |
| | |
| | tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer") |
| | |
| | # Example usage |
| | text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?" |
| | encoding = tokenizer.encode(text) |
| | tokens = tokenizer.convert_ids_to_tokens(encoding) |
| | decoded_text = tokenizer.decode(encoding) |
| | |
| | print("Original Text:", text) |
| | print("Tokens:", tokens) |
| | print("Decoded Text:", decoded_text) |
| | ``` |
| |
|
| | **Output:** |
| |
|
| | ``` |
| | Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? |
| | Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?'] |
| | Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization. |
| |
|
| | - **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada. |
| | - **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters. |
| |
|
| | ## Training Procedure |
| |
|
| | - **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently. |
| | - **Pre-tokenization:** The text was pre-tokenized using whitespace splitting. |
| | - **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada. |
| | - **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency. |
| | - **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks. |
| | - **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library. |
| |
|
| | ## Evaluation |
| |
|
| | The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed. |
| |
|
| | ## Limitations |
| |
|
| | - **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms. |
| | - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications. |
| | - **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks. |
| |
|
| | ## Ethical Considerations |
| |
|
| | - **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included. |
| | - **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data. |
| |
|
| | ## Recommendations |
| |
|
| | - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data. |
| | - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements. |
| |
|
| | ## Acknowledgments |
| |
|
| | - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). |
| | - **Libraries:** |
| | - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) |
| | - [Hugging Face Transformers](https://github.com/huggingface/transformers) |
| |
|
| | ## License |
| |
|
| | This tokenizer is released under the [MIT License](LICENSE). |
| |
|
| | ## Citation |
| |
|
| | If you use this tokenizer in your research or applications, please consider citing it: |
| |
|
| | ```bibtex |
| | @misc{kannada_tokenizer_2023, |
| | title={Kannada Tokenizer}, |
| | author={charanhu}, |
| | year={2023}, |
| | publisher={Hugging Face}, |
| | howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}}, |
| | } |
| | ``` |
| |
|