pt4c commited on
Commit
f720cce
·
verified ·
1 Parent(s): 74d11c9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -58
README.md CHANGED
@@ -1,58 +1,56 @@
1
-
2
- # Yambeta Tokenizer for NLP tasks
3
-
4
- ## Model Description
5
- This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
6
-
7
- - **Developed by**: DS4H-ICTU Research Group
8
- - **Language(s)**: Yambeta (Bantu language from Cameroon)
9
- - **License**: Apache 2.0 (or specify if different)
10
- - **Model Type**: Tokenizer (WordPiece)
11
-
12
- ## Model Sources
13
- - **Repository**: [Your repository URL]
14
- - **Paper**: [Link to related paper if available]
15
- - **Demo**: [Optional: link to demo]
16
-
17
- ## Uses
18
- - **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
19
- - **Downstream Use**: Can be used as a foundation for models processing Yambeta text.
20
-
21
- ## Bias, Risks, and Limitations
22
- - **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
23
- - **Out-of-Scope Use**: The tokenizer may not perform well for non-Yambeta languages.
24
-
25
- ## Training Details
26
- - **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
27
- - **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
28
- - **Training Hyperparameters**:
29
- - Vocabulary Size: 25,000
30
- - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
31
-
32
- ## Evaluation
33
- - **OOV Rate**: 0.36%
34
- - **Tokenization Efficiency**: Average tokens per sentence: 23.25
35
- - **Special Character Handling**: Successfully handles diacritics and tone markers in Yambeta.
36
-
37
- ## Environmental Impact
38
- - **Hardware Type**: Google Colab GPU
39
- - **Hours Used**: 4 hours (training time)
40
- - **Cloud Provider**: Google Cloud
41
- - **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
42
-
43
- ## Citation
44
- If you use this tokenizer in your work, please cite it using the following format:
45
-
46
- ```
47
- @misc{yambeta_tokenizer,
48
- title = {Yambeta Tokenizer},
49
- author = {Dr.-Ing. Philippe Tamla},
50
- year = {2024},
51
- publisher = {Hugging Face},
52
- url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
53
- }
54
- ```
55
-
56
- ## Contact Information
57
- For more information, contact the developers at: philiptamla@gmail.com
58
-
 
1
+ # Yambeta Tokenizer for NLP tasks
2
+
3
+ ## Model Description
4
+ This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
5
+
6
+ - **Developed by**: DS4H-ICTU Research Group
7
+ - **Language(s)**: Yambeta (Bantu language from Cameroon)
8
+ - **License**: Apache 2.0 (or specify if different)
9
+ - **Model Type**: Tokenizer (WordPiece)
10
+
11
+ ## Model Sources
12
+ - **Repository**: [Your repository URL]
13
+ - **Paper**: [Link to related paper if available]
14
+ - **Demo**: [Optional: link to demo]
15
+
16
+ ## Uses
17
+ - **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
18
+ - **Downstream Use**: Can be used as a foundation for models processing Yambeta text.
19
+
20
+ ## Bias, Risks, and Limitations
21
+ - **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
22
+ - **Out-of-Scope Use**: The tokenizer may not perform well for non-Yambeta languages.
23
+
24
+ ## Training Details
25
+ - **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
26
+ - **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
27
+ - **Training Hyperparameters**:
28
+ - Vocabulary Size: 25,000
29
+ - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
30
+
31
+ ## Evaluation
32
+ - **OOV Rate**: 0.36%
33
+ - **Tokenization Efficiency**: Average tokens per sentence: 23.25
34
+ - **Special Character Handling**: Successfully handles diacritics and tone markers in Yambeta.
35
+
36
+ ## Environmental Impact
37
+ - **Hardware Type**: Google Colab GPU
38
+ - **Hours Used**: 4 hours (training time)
39
+ - **Cloud Provider**: Google Cloud
40
+ - **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator
41
+
42
+ ## Citation
43
+ If you use this tokenizer in your work, please cite it using the following format:
44
+
45
+ ```
46
+ @misc{yambeta_tokenizer,
47
+ title = {Yambeta Tokenizer},
48
+ author = {Dr.-Ing. Philippe Tamla},
49
+ year = {2024},
50
+ publisher = {Hugging Face},
51
+ url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
52
+ }
53
+ ```
54
+
55
+ ## Contact Information
56
+ For more information, contact the developers at: philiptamla@gmail.com