YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Fulfulde Tokenizer for NLP tasks

Model Description

This tokenizer was developed for Fulfulde, a language from the fula[ni] family of languages in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.

  • Developed by: DS4H-ICTU Research Group in Cooperation with the
  • Language(s): Fulfulde Adamaoua (Fula[ni] language from Cameroon)
  • License: Apache 2.0 (or specify if different)
  • Model Type: Tokenizer (WordPiece)

Model Sources

  • Repository: [Your repository URL]
  • Paper: [Link to related paper if available]
  • Demo: [Optional: link to demo]

Uses

  • Direct Use: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Fulfulde Adamaoua language.
  • Downstream Use: Can be used as a foundation for models processing Fulfulde Adamaoua text.

Bias, Risks, and Limitations

  • Biases: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Fulfulde corpus.
  • Out-of-Scope Use: The tokenizer may not perform well for non-Fulfulde languages.

Training Details

  • Training Data: Extracted from YambeFulfulde Adamaoua Bible text corpus (final_dataset.xlsx).
  • Training Procedure: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
  • Training Hyperparameters:
    • Vocabulary Size: 25,000
    • Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]

Evaluation

  • OOV Rate: 0.00%
  • Tokenization Efficiency: Average tokens per sentence: 26.884630350194552
  • Special Character Handling: Successfully handles diacritics and tone markers in Fulfulde Adamaoua.

Environmental Impact

  • Hardware Type: Google Colab GPU
  • Hours Used: 4 hours (training time)
  • Cloud Provider: Google Cloud
  • Carbon Emitted: Estimated using Lacoste et al. (2019) calculator

Citation

If you use this tokenizer in your work, please cite it using the following format:

@misc{fulfulde_tokenizer,
  title = {Fulfulde Adamaoua Tokenizer},
  author = {Ing. Zingui Fred Mike},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/FredMike23/tokenizer-Fulfulde_adamaoua}
}

Contact Information

For more information, contact the developers at: mikezingui@yahoo.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support