FredMike23
/

tokenizer-Bafia

Model card Files Files and versions

xet

Community

FredMike23 commited on May 11, 2025

Commit

13f4358

verified ·

1 Parent(s): 1e714d1

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -0

README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+#  bafia Tokenizer for NLP tasks
+## Model Description
+This tokenizer was developed for bafia, a  language from the fula[ksf] family of languages  in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.
+- **Developed by**: DS4H-ICTU Research Group in Cooperation with the
+- **Language(s)**: bafia (bafia[ksf] language from Cameroon)
+- **License**: Apache 2.0 (or specify if different)
+- **Model Type**: Tokenizer (WordPiece)
+## Model Sources
+- **Repository**: [Your repository URL]
+- **Paper**: [Link to related paper if available]
+- **Demo**: [Optional: link to demo]
+## Uses
+- **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the bafia language.
+- **Downstream Use**: Can be used as a foundation for models processing bafia text.
+## Bias, Risks, and Limitations
+- **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the bafia corpus.
+- **Out-of-Scope Use**: The tokenizer may not perform well for non-bafia languages.
+## Training Details
+- **Training Data**: Extracted from bafia  Bible text corpus (bafia_DATASET.xlsx).
+- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
+- **Training Hyperparameters**:
+  - Vocabulary Size: 19076
+  - Special Tokens: "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"
+## Evaluation
+- **OOV Rate**: 0.00%
+- **Tokenization Efficiency**: Average tokens per sentence: 27.585227817745803
+- **Special Character Handling**: Successfully handles diacritics and tone markers in bafia.
+## Environmental Impact
+- **Hardware Type**: Google Colab GPU
+- **Hours Used**: 4 hours (training time)
+- **Cloud Provider**: Google Cloud
+- **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
+## Citation
+If you use this tokenizer in your work, please cite it using the following format:
+```
+@misc{bafia_tokenizer,
+  title = {bafia Tokenizer},
+  author = {Ing. Zingui Fred Mike},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/FredMike23/tokenizer-Bafia}
+}
+```
+## Contact Information
+For more information, contact the developers at: philiptamla@gmail.com