DS4H-ICTU
/

yat-bert-tokenizer

Model card Files Files and versions

xet

Community

pt4c commited on Oct 9, 2024

Commit

f720cce

verified ·

1 Parent(s): 74d11c9

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -58

README.md CHANGED Viewed

@@ -1,58 +1,56 @@
-    # Yambeta Tokenizer for NLP tasks
-    ## Model Description
-    This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
-    - **Developed by**: DS4H-ICTU Research Group
-    - **Language(s)**: Yambeta (Bantu language from Cameroon)
-    - **License**: Apache 2.0 (or specify if different)
-    - **Model Type**: Tokenizer (WordPiece)
-    ## Model Sources
-    - **Repository**: [Your repository URL]
-    - **Paper**: [Link to related paper if available]
-    - **Demo**: [Optional: link to demo]
-    ## Uses
-    - **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
-    - **Downstream Use**: Can be used as a foundation for models processing Yambeta text.
-    ## Bias, Risks, and Limitations
-    - **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
-    - **Out-of-Scope Use**: The tokenizer may not perform well for non-Yambeta languages.
-    ## Training Details
-    - **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
-    - **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
-    - **Training Hyperparameters**:
-      - Vocabulary Size: 25,000
-      - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
-    ## Evaluation
-    - **OOV Rate**: 0.36%
-    - **Tokenization Efficiency**: Average tokens per sentence: 23.25
-    - **Special Character Handling**: Successfully handles diacritics and tone markers in Yambeta.
-    ## Environmental Impact
-    - **Hardware Type**: Google Colab GPU
-    - **Hours Used**: 4 hours (training time)
-    - **Cloud Provider**: Google Cloud
-    - **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
-    ## Citation
-    If you use this tokenizer in your work, please cite it using the following format:
-    ```
-    @misc{yambeta_tokenizer,
-      title = {Yambeta Tokenizer},
-      author = {Dr.-Ing. Philippe Tamla},
-      year = {2024},
-      publisher = {Hugging Face},
-      url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
-    }
-    ```
-    ## Contact Information
-    For more information, contact the developers at: philiptamla@gmail.com

+# Yambeta Tokenizer for NLP tasks
+## Model Description
+This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
+- **Developed by**: DS4H-ICTU Research Group
+- **Language(s)**: Yambeta (Bantu language from Cameroon)
+- **License**: Apache 2.0 (or specify if different)
+- **Model Type**: Tokenizer (WordPiece)
+## Model Sources
+- **Repository**: [Your repository URL]
+- **Paper**: [Link to related paper if available]
+- **Demo**: [Optional: link to demo]
+## Uses
+- **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
+- **Downstream Use**: Can be used as a foundation for models processing Yambeta text.
+## Bias, Risks, and Limitations
+- **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
+- **Out-of-Scope Use**: The tokenizer may not perform well for non-Yambeta languages.
+## Training Details
+- **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
+- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
+- **Training Hyperparameters**:
+  - Vocabulary Size: 25,000
+  - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
+## Evaluation
+- **OOV Rate**: 0.36%
+- **Tokenization Efficiency**: Average tokens per sentence: 23.25
+- **Special Character Handling**: Successfully handles diacritics and tone markers in Yambeta.
+## Environmental Impact
+- **Hardware Type**: Google Colab GPU
+- **Hours Used**: 4 hours (training time)
+- **Cloud Provider**: Google Cloud
+- **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
+## Citation
+If you use this tokenizer in your work, please cite it using the following format:
+```
+@misc{yambeta_tokenizer,
+  title = {Yambeta Tokenizer},
+  author = {Dr.-Ing. Philippe Tamla},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
+}
+```
+## Contact Information
+For more information, contact the developers at: philiptamla@gmail.com