FredMike23
/

tokenizer-Fulfulde_adamaoua

Model card Files Files and versions

xet

Community

FredMike23 commited on May 11, 2025

Commit

4a1a12f

verified ·

1 Parent(s): ae3ce33

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -0

README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# Fulfulde Tokenizer for NLP tasks
+## Model Description
+This tokenizer was developed for Fulfulde, a  language from the fula[ni] family of languages  in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.
+- **Developed by**: DS4H-ICTU Research Group in Cooperation with the
+- **Language(s)**: Fulfulde Adamaoua (Fula[ni] language from Cameroon)
+- **License**: Apache 2.0 (or specify if different)
+- **Model Type**: Tokenizer (WordPiece)
+## Model Sources
+- **Repository**: [Your repository URL]
+- **Paper**: [Link to related paper if available]
+- **Demo**: [Optional: link to demo]
+## Uses
+- **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Fulfulde Adamaoua language.
+- **Downstream Use**: Can be used as a foundation for models processing Fulfulde Adamaoua text.
+## Bias, Risks, and Limitations
+- **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Fulfulde corpus.
+- **Out-of-Scope Use**: The tokenizer may not perform well for non-Fulfulde languages.
+## Training Details
+- **Training Data**: Extracted from YambeFulfulde Adamaoua  Bible text corpus (final_dataset.xlsx).
+- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
+- **Training Hyperparameters**:
+  - Vocabulary Size: 25,000
+  - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
+## Evaluation
+- **OOV Rate**: 0.00%
+- **Tokenization Efficiency**: Average tokens per sentence: 26.884630350194552
+- **Special Character Handling**: Successfully handles diacritics and tone markers in Fulfulde Adamaoua.
+## Environmental Impact
+- **Hardware Type**: Google Colab GPU
+- **Hours Used**: 4 hours (training time)
+- **Cloud Provider**: Google Cloud
+- **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
+## Citation
+If you use this tokenizer in your work, please cite it using the following format:
+```
+@misc{fulfulde_tokenizer,
+  title = {Fulfulde Adamaoua Tokenizer},
+  author = {Ing. Zingui Fred Mike},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/FredMike23/tokenizer-Fulfulde_adamaoua}
+}
+```
+## Contact Information
+For more information, contact the developers at: mikezingui@yahoo.com