FredMike23 commited on
Commit
4a1a12f
·
verified ·
1 Parent(s): ae3ce33

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fulfulde Tokenizer for NLP tasks
2
+
3
+ ## Model Description
4
+ This tokenizer was developed for Fulfulde, a language from the fula[ni] family of languages in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.
5
+
6
+ - **Developed by**: DS4H-ICTU Research Group in Cooperation with the
7
+ - **Language(s)**: Fulfulde Adamaoua (Fula[ni] language from Cameroon)
8
+ - **License**: Apache 2.0 (or specify if different)
9
+ - **Model Type**: Tokenizer (WordPiece)
10
+
11
+ ## Model Sources
12
+ - **Repository**: [Your repository URL]
13
+ - **Paper**: [Link to related paper if available]
14
+ - **Demo**: [Optional: link to demo]
15
+
16
+ ## Uses
17
+ - **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Fulfulde Adamaoua language.
18
+ - **Downstream Use**: Can be used as a foundation for models processing Fulfulde Adamaoua text.
19
+
20
+ ## Bias, Risks, and Limitations
21
+ - **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Fulfulde corpus.
22
+ - **Out-of-Scope Use**: The tokenizer may not perform well for non-Fulfulde languages.
23
+
24
+ ## Training Details
25
+ - **Training Data**: Extracted from YambeFulfulde Adamaoua Bible text corpus (final_dataset.xlsx).
26
+ - **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
27
+ - **Training Hyperparameters**:
28
+ - Vocabulary Size: 25,000
29
+ - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
30
+
31
+ ## Evaluation
32
+ - **OOV Rate**: 0.00%
33
+ - **Tokenization Efficiency**: Average tokens per sentence: 26.884630350194552
34
+ - **Special Character Handling**: Successfully handles diacritics and tone markers in Fulfulde Adamaoua.
35
+
36
+ ## Environmental Impact
37
+ - **Hardware Type**: Google Colab GPU
38
+ - **Hours Used**: 4 hours (training time)
39
+ - **Cloud Provider**: Google Cloud
40
+ - **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
41
+
42
+ ## Citation
43
+ If you use this tokenizer in your work, please cite it using the following format:
44
+
45
+ ```
46
+ @misc{fulfulde_tokenizer,
47
+ title = {Fulfulde Adamaoua Tokenizer},
48
+ author = {Ing. Zingui Fred Mike},
49
+ year = {2024},
50
+ publisher = {Hugging Face},
51
+ url = {https://huggingface.co/FredMike23/tokenizer-Fulfulde_adamaoua}
52
+ }
53
+ ```
54
+
55
+ ## Contact Information
56
+ For more information, contact the developers at: mikezingui@yahoo.com