YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
bafia Tokenizer for NLP tasks
Model Description
This tokenizer was developed for bafia, a language from the fula[ksf] family of languages in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.
- Developed by: DS4H-ICTU Research Group in Cooperation with the
- Language(s): bafia (bafia[ksf] language from Cameroon)
- License: Apache 2.0 (or specify if different)
- Model Type: Tokenizer (WordPiece)
Model Sources
- Repository: [Your repository URL]
- Paper: [Link to related paper if available]
- Demo: [Optional: link to demo]
Uses
- Direct Use: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the bafia language.
- Downstream Use: Can be used as a foundation for models processing bafia text.
Bias, Risks, and Limitations
- Biases: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the bafia corpus.
- Out-of-Scope Use: The tokenizer may not perform well for non-bafia languages.
Training Details
- Training Data: Extracted from bafia Bible text corpus (bafia_DATASET.xlsx).
- Training Procedure: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
- Training Hyperparameters:
- Vocabulary Size: 19076
- Special Tokens: "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"
Evaluation
- OOV Rate: 0.00%
- Tokenization Efficiency: Average tokens per sentence: 27.585227817745803
- Special Character Handling: Successfully handles diacritics and tone markers in bafia.
Environmental Impact
- Hardware Type: Google Colab GPU
- Hours Used: 4 hours (training time)
- Cloud Provider: Google Cloud
- Carbon Emitted: Estimated using Lacoste et al. (2019) calculator
Citation
If you use this tokenizer in your work, please cite it using the following format:
@misc{bafia_tokenizer,
title = {bafia Tokenizer},
author = {Ing. Zingui Fred Mike},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/FredMike23/tokenizer-Bafia}
}
Contact Information
For more information, contact the developers at: philiptamla@gmail.com
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support