charanhu
/

kannada-tokenizer

@@ -1,10 +1,24 @@
 # Kannada Tokenizer
 [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
 This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
-## Model Details
 - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
 - **Language:** Kannada (`kn`)
@@ -15,33 +29,23 @@ This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannad
   - `[CLS]` (Classifier token)
   - `[SEP]` (Separator token)
   - `[MASK]` (Masking token)
-## Training Data
-The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
-- **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
-- **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
-## Training Procedure
-- **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
-- **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
-- **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
-- **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
 ## Intended Use
 This tokenizer is intended for NLP applications involving the Kannada language, such as:
-- Language Modeling
-- Text Classification
-- Machine Translation
-- Named Entity Recognition
-- Question Answering
-- Summarization
-## Usage
 You can load the tokenizer directly from the Hugging Face Hub:
@@ -69,21 +73,42 @@ Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
 Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
 ```
 ## Limitations
-- **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada.
 - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
-- **OOV Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token.
 ## Recommendations
 - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
 - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
-## License
-[MIT License](LICENSE)
 ## Acknowledgments
 - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
@@ -91,6 +116,10 @@ Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
   - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
   - [Hugging Face Transformers](https://github.com/huggingface/transformers)
 ## Citation
 If you use this tokenizer in your research or applications, please consider citing it:
@@ -100,10 +129,7 @@ If you use this tokenizer in your research or applications, please consider citi
   title={Kannada Tokenizer},
   author={charanhu},
   year={2023},
   howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
 }
 ```
-## Contact Information
-For questions or comments about the tokenizer, please contact [charanhu](https://huggingface.co/charanhu).

+---
+language: kn
+tags:
+  - kannada
+  - tokenizer
+  - bpe
+  - nlp
+  - huggingface
+license: mit
+datasets:
+  - Cognitive-Lab/Kannada-Instruct-dataset
+pipeline_tag: text-generation
+---
 # Kannada Tokenizer
 [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
 This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
+## Model Description
 - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
 - **Language:** Kannada (`kn`)
   - `[CLS]` (Classifier token)
   - `[SEP]` (Separator token)
   - `[MASK]` (Masking token)
+- **License:** MIT License
+- **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)
+- **Algorithm:** Byte-Pair Encoding (BPE)
 ## Intended Use
 This tokenizer is intended for NLP applications involving the Kannada language, such as:
+- **Language Modeling**
+- **Text Generation**
+- **Text Classification**
+- **Machine Translation**
+- **Named Entity Recognition**
+- **Question Answering**
+- **Summarization**
+## How to Use
 You can load the tokenizer directly from the Hugging Face Hub:
 Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
 ```
+## Training Data
+The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
+- **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
+- **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
+## Training Procedure
+- **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
+- **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
+- **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
+- **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency.
+- **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks.
+- **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
+## Evaluation
+The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.
 ## Limitations
+- **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
 - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
+- **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks.
+## Ethical Considerations
+- **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
+- **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.
 ## Recommendations
 - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
 - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
 ## Acknowledgments
 - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
   - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
   - [Hugging Face Transformers](https://github.com/huggingface/transformers)
+## License
+This tokenizer is released under the [MIT License](LICENSE).
 ## Citation
 If you use this tokenizer in your research or applications, please consider citing it:
   title={Kannada Tokenizer},
   author={charanhu},
   year={2023},
+  publisher={Hugging Face},
   howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
 }
 ```