Yambeta Tokenizer for NLP tasks

Model Description

This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.

Developed by: DS4H-ICTU Research Group in Cooperation with the
Language(s): Yambeta (Bantu language from Cameroon)
License: Apache 2.0 (or specify if different)
Model Type: Tokenizer (WordPiece)

Model Sources

Repository: [Your repository URL]
Paper: [Link to related paper if available]
Demo: [Optional: link to demo]

Uses

Direct Use: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
Downstream Use: Can be used as a foundation for models processing Yambeta text.

Bias, Risks, and Limitations

Biases: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
Out-of-Scope Use: The tokenizer may not perform well for non-Yambeta languages.

Training Details

Training Data: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
Training Procedure: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
Training Hyperparameters:
- Vocabulary Size: 25,000
- Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]

Evaluation

OOV Rate: 0.36%
Tokenization Efficiency: Average tokens per sentence: 23.25
Special Character Handling: Successfully handles diacritics and tone markers in Yambeta.

Environmental Impact

Hardware Type: Google Colab GPU
Hours Used: 4 hours (training time)
Cloud Provider: Google Cloud
Carbon Emitted: Estimated using Lacoste et al. (2019) calculator

Citation

If you use this tokenizer in your work, please cite it using the following format:

@misc{yambeta_tokenizer,
  title = {Yambeta Tokenizer},
  author = {Dr.-Ing. Philippe Tamla},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
}

Contact Information

For more information, contact the developers at: philiptamla@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for DS4H-ICTU/yat-bert-tokenizer

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 49