FredMike23
/

tokenizer-Bafia

Model card Files Files and versions

tokenizer-Bafia / README.md

FredMike23's picture

Upload README.md with huggingface_hub

13f4358 verified 8 months ago

|

history blame contribute delete

2.44 kB

	# bafia Tokenizer for NLP tasks

	## Model Description
	This tokenizer was developed for bafia, a language from the fula[ksf] family of languages in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.

	- Developed by: DS4H-ICTU Research Group in Cooperation with the
	- Language(s): bafia (bafia[ksf] language from Cameroon)
	- License: Apache 2.0 (or specify if different)
	- Model Type: Tokenizer (WordPiece)

	## Model Sources
	- Repository: [Your repository URL]
	- Paper: [Link to related paper if available]
	- Demo: [Optional: link to demo]

	## Uses
	- Direct Use: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the bafia language.
	- Downstream Use: Can be used as a foundation for models processing bafia text.

	## Bias, Risks, and Limitations
	- Biases: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the bafia corpus.
	- Out-of-Scope Use: The tokenizer may not perform well for non-bafia languages.

	## Training Details
	- Training Data: Extracted from bafia Bible text corpus (bafia_DATASET.xlsx).
	- Training Procedure: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
	- Training Hyperparameters:
	- Vocabulary Size: 19076
	- Special Tokens: "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"

	## Evaluation
	- OOV Rate: 0.00%
	- Tokenization Efficiency: Average tokens per sentence: 27.585227817745803
	- Special Character Handling: Successfully handles diacritics and tone markers in bafia.

	## Environmental Impact
	- Hardware Type: Google Colab GPU
	- Hours Used: 4 hours (training time)
	- Cloud Provider: Google Cloud
	- Carbon Emitted: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator

	## Citation
	If you use this tokenizer in your work, please cite it using the following format:

	```
	@misc{bafia_tokenizer,
	title = {bafia Tokenizer},
	author = {Ing. Zingui Fred Mike},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/FredMike23/tokenizer-Bafia}
	}
	```

	## Contact Information
	For more information, contact the developers at: philiptamla@gmail.com