Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
## Model Description
|
| 4 |
This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
|
| 5 |
|
| 6 |
-
- **Developed by**: DS4H-ICTU Research Group
|
| 7 |
- **Language(s)**: Yambeta (Bantu language from Cameroon)
|
| 8 |
- **License**: Apache 2.0 (or specify if different)
|
| 9 |
- **Model Type**: Tokenizer (WordPiece)
|
|
@@ -24,7 +24,7 @@ This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The to
|
|
| 24 |
## Training Details
|
| 25 |
- **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
|
| 26 |
- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
|
| 27 |
-
- **Training Hyperparameters**:
|
| 28 |
- Vocabulary Size: 25,000
|
| 29 |
- Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
|
| 30 |
|
|
|
|
| 3 |
## Model Description
|
| 4 |
This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.
|
| 5 |
|
| 6 |
+
- **Developed by**: DS4H-ICTU Research Group in Cooperation with the
|
| 7 |
- **Language(s)**: Yambeta (Bantu language from Cameroon)
|
| 8 |
- **License**: Apache 2.0 (or specify if different)
|
| 9 |
- **Model Type**: Tokenizer (WordPiece)
|
|
|
|
| 24 |
## Training Details
|
| 25 |
- **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
|
| 26 |
- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
|
| 27 |
+
- **Training Hyperparameters**:
|
| 28 |
- Vocabulary Size: 25,000
|
| 29 |
- Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
|
| 30 |
|