FredMike23 commited on
Commit
13f4358
·
verified ·
1 Parent(s): 1e714d1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # bafia Tokenizer for NLP tasks
2
+
3
+ ## Model Description
4
+ This tokenizer was developed for bafia, a language from the fula[ksf] family of languages in Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Fulfulde language.
5
+
6
+ - **Developed by**: DS4H-ICTU Research Group in Cooperation with the
7
+ - **Language(s)**: bafia (bafia[ksf] language from Cameroon)
8
+ - **License**: Apache 2.0 (or specify if different)
9
+ - **Model Type**: Tokenizer (WordPiece)
10
+
11
+ ## Model Sources
12
+ - **Repository**: [Your repository URL]
13
+ - **Paper**: [Link to related paper if available]
14
+ - **Demo**: [Optional: link to demo]
15
+
16
+ ## Uses
17
+ - **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the bafia language.
18
+ - **Downstream Use**: Can be used as a foundation for models processing bafia text.
19
+
20
+ ## Bias, Risks, and Limitations
21
+ - **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the bafia corpus.
22
+ - **Out-of-Scope Use**: The tokenizer may not perform well for non-bafia languages.
23
+
24
+ ## Training Details
25
+ - **Training Data**: Extracted from bafia Bible text corpus (bafia_DATASET.xlsx).
26
+ - **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
27
+ - **Training Hyperparameters**:
28
+ - Vocabulary Size: 19076
29
+ - Special Tokens: "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"
30
+
31
+ ## Evaluation
32
+ - **OOV Rate**: 0.00%
33
+ - **Tokenization Efficiency**: Average tokens per sentence: 27.585227817745803
34
+ - **Special Character Handling**: Successfully handles diacritics and tone markers in bafia.
35
+
36
+ ## Environmental Impact
37
+ - **Hardware Type**: Google Colab GPU
38
+ - **Hours Used**: 4 hours (training time)
39
+ - **Cloud Provider**: Google Cloud
40
+ - **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
41
+
42
+ ## Citation
43
+ If you use this tokenizer in your work, please cite it using the following format:
44
+
45
+ ```
46
+ @misc{bafia_tokenizer,
47
+ title = {bafia Tokenizer},
48
+ author = {Ing. Zingui Fred Mike},
49
+ year = {2024},
50
+ publisher = {Hugging Face},
51
+ url = {https://huggingface.co/FredMike23/tokenizer-Bafia}
52
+ }
53
+ ```
54
+
55
+ ## Contact Information
56
+ For more information, contact the developers at: philiptamla@gmail.com