eligapris commited on
Commit
b01aef0
·
verified ·
1 Parent(s): 94f675f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -69
README.md CHANGED
@@ -1,10 +1,16 @@
1
- # Kirundi Tokenizer and LoRA Model
 
 
 
 
 
 
 
 
2
 
3
  ## Model Description
4
 
5
- This repository contains two main components:
6
- 1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
7
- 2. A LoRA adapter trained for Kirundi language processing
8
 
9
  ### Tokenizer Details
10
  - **Type**: BPE (Byte-Pair Encoding)
@@ -12,19 +18,11 @@ This repository contains two main components:
12
  - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
13
  - **Pre-tokenization**: Whitespace-based
14
 
15
- ### LoRA Adapter Details
16
- - **Base Model**: [To be filled with your chosen base model]
17
- - **Rank**: 8
18
- - **Alpha**: 32
19
- - **Target Modules**: Query and Value attention matrices
20
- - **Dropout**: 0.05
21
-
22
  ## Intended Uses & Limitations
23
 
24
  ### Intended Uses
25
  - Text processing for Kirundi language
26
- - Machine translation tasks involving Kirundi
27
- - Natural language understanding tasks for Kirundi content
28
  - Foundation for developing Kirundi language applications
29
 
30
  ### Limitations
@@ -34,103 +32,129 @@ This repository contains two main components:
34
 
35
  ## Training Data
36
 
37
- The model components were trained on the Kirundi-English parallel corpus:
38
  - **Dataset**: eligapris/kirundi-english
39
  - **Size**: 21.4k sentence pairs
40
  - **Nature**: Parallel corpus with Kirundi and English translations
41
  - **Domain**: Mixed domain including religious, general, and conversational text
42
 
43
- ## Training Procedure
 
 
44
 
45
- ### Tokenizer Training
46
- - Trained using Hugging Face's Tokenizers library
47
- - BPE algorithm with a vocabulary size of 30k
48
- - Includes special tokens for task-specific usage
49
- - Trained on the Kirundi portion of the parallel corpus
50
 
51
- ### LoRA Training
52
- [To be filled with your specific training details]
53
- - Number of epochs:
54
- - Batch size:
55
- - Learning rate:
56
- - Training hardware:
57
- - Training time:
58
 
59
- ## Evaluation Results
 
 
 
60
 
61
- [To be filled with your evaluation metrics]
62
- - Coverage statistics:
63
- - Out-of-vocabulary rate:
64
- - Task-specific metrics:
65
 
66
- ## Environmental Impact
 
 
 
67
 
68
- [To be filled with training compute details]
69
- - Estimated CO2 emissions:
70
- - Hardware used:
71
- - Training duration:
72
 
73
- ## Technical Specifications
74
 
75
- ### Model Architecture
76
- - Tokenizer: BPE-based with custom vocabulary
77
- - LoRA Configuration:
78
- - r=8 (rank)
79
- - α=32 (scaling)
80
- - Trained on specific attention layers
81
- - Dropout rate: 0.05
82
 
83
- ### Software Requirements
84
  ```python
85
- dependencies = {
86
- "transformers": ">=4.30.0",
87
- "tokenizers": ">=0.13.0",
88
- "peft": ">=0.4.0"
89
- }
 
 
90
  ```
91
 
92
- ## How to Use
93
 
94
- ### Loading the Tokenizer
95
  ```python
96
- from transformers import PreTrainedTokenizerFast
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
 
 
 
 
 
99
  ```
100
 
101
- ### Loading the LoRA Model
102
  ```python
103
- from peft import PeftModel, PeftConfig
104
- from transformers import AutoModelForSequenceClassification
 
 
 
105
 
106
- config = PeftConfig.from_pretrained("path_to_lora_model")
107
- model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
108
- model = PeftModel.from_pretrained(model, "path_to_lora_model")
 
 
 
 
 
 
 
 
109
  ```
110
 
111
- ## Citation
 
112
 
113
- [To be filled with your preferred citation format]
114
 
115
- ## License
 
 
 
 
 
 
116
 
117
- [Specify your chosen license]
118
 
119
  ## Contact
120
 
121
- [Your contact information or preferred method of contact]
122
 
123
  ---
124
 
125
  ## Updates and Versions
126
 
127
  - v1.0.0 (Initial Release)
128
- - Base tokenizer and LoRA model
129
  - Trained on Kirundi-English parallel corpus
130
  - Basic functionality and documentation
131
 
132
  ## Acknowledgments
133
 
134
  - Dataset provided by eligapris
135
- - Hugging Face's Transformers and Tokenizers libraries
136
- - PEFT library for LoRA implementation
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - eligapris/kirundi-english
5
+ language:
6
+ - rn
7
+ library_name: transformers
8
+ ---
9
+ # eligapris/rn-tokenizer
10
 
11
  ## Model Description
12
 
13
+ This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run).
 
 
14
 
15
  ### Tokenizer Details
16
  - **Type**: BPE (Byte-Pair Encoding)
 
18
  - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
19
  - **Pre-tokenization**: Whitespace-based
20
 
 
 
 
 
 
 
 
21
  ## Intended Uses & Limitations
22
 
23
  ### Intended Uses
24
  - Text processing for Kirundi language
25
+ - Pre-processing for NLP tasks involving Kirundi
 
26
  - Foundation for developing Kirundi language applications
27
 
28
  ### Limitations
 
32
 
33
  ## Training Data
34
 
35
+ The tokenizer was trained on the Kirundi-English parallel corpus:
36
  - **Dataset**: eligapris/kirundi-english
37
  - **Size**: 21.4k sentence pairs
38
  - **Nature**: Parallel corpus with Kirundi and English translations
39
  - **Domain**: Mixed domain including religious, general, and conversational text
40
 
41
+ ## Installation
42
+
43
+ You can use this tokenizer in your project by first installing the required dependencies:
44
 
45
+ ```bash
46
+ pip install transformers
47
+ ```
 
 
48
 
49
+ Then load the tokenizer directly from the Hugging Face Hub:
 
 
 
 
 
 
50
 
51
+ ```python
52
+ from transformers import AutoTokenizer
53
+ tokenizer = AutoTokenizer.from_pretrained("your-username/kirundi-tokenizer")
54
+ ```
55
 
56
+ Or if you have downloaded the tokenizer files locally:
 
 
 
57
 
58
+ ```python
59
+ from transformers import PreTrainedTokenizerFast
60
+ tokenizer = PreTrainedTokenizerFast(tokenizer_file="kirundi_tokenizer.json")
61
+ ```
62
 
63
+ ## Usage Examples
 
 
 
64
 
65
+ ### Loading and Using the Tokenizer
66
 
67
+ You can load the tokenizer in two ways:
 
 
 
 
 
 
68
 
 
69
  ```python
70
+ # Method 1: Using AutoTokenizer (recommended)
71
+ from transformers import AutoTokenizer
72
+ tokenizer = AutoTokenizer.from_pretrained("your-username/kirundi-tokenizer")
73
+
74
+ # Method 2: Using PreTrainedTokenizerFast with local file
75
+ from transformers import PreTrainedTokenizerFast
76
+ tokenizer = PreTrainedTokenizerFast(tokenizer_file="kirundi_tokenizer.json")
77
  ```
78
 
79
+ #### Basic Usage Examples
80
 
81
+ 1. Tokenize a single sentence:
82
  ```python
83
+ # Basic tokenization
84
+ text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana."
85
+ encoded = tokenizer(text)
86
+ print(f"Input IDs: {encoded['input_ids']}")
87
+ print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")
88
+ ```
89
+
90
+ 2. Batch tokenization:
91
+ ```python
92
+ # Process multiple sentences at once
93
+ texts = [
94
+ "ifumbire mvaruganda.",
95
+ "aba azi gukora kandi afite ubushobozi"
96
+ ]
97
+ encoded = tokenizer(texts, padding=True, truncation=True)
98
+ print("Batch encoding:", encoded)
99
+ ```
100
 
101
+ 3. Get token IDs with special tokens:
102
+ ```python
103
+ # Add special tokens like [CLS] and [SEP]
104
+ encoded = tokenizer(text, add_special_tokens=True)
105
+ tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
106
+ print(f"Tokens with special tokens: {tokens}")
107
  ```
108
 
109
+ 4. Decode tokenized text:
110
  ```python
111
+ # Convert token IDs back to text
112
+ ids = encoded['input_ids']
113
+ decoded_text = tokenizer.decode(ids)
114
+ print(f"Decoded text: {decoded_text}")
115
+ ```
116
 
117
+ 5. Padding and truncation:
118
+ ```python
119
+ # Pad or truncate sequences to a specific length
120
+ encoded = tokenizer(
121
+ texts,
122
+ padding='max_length',
123
+ max_length=32,
124
+ truncation=True,
125
+ return_tensors='pt' # Return PyTorch tensors
126
+ )
127
+ print("Padded sequences:", encoded['input_ids'].shape)
128
  ```
129
 
130
+ ## Future Development
131
+ This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation).
132
 
133
+ ## Technical Specifications
134
 
135
+ ### Software Requirements
136
+ ```python
137
+ dependencies = {
138
+ "transformers": ">=4.30.0",
139
+ "tokenizers": ">=0.13.0"
140
+ }
141
+ ```
142
 
 
143
 
144
  ## Contact
145
 
146
+ eligrapris
147
 
148
  ---
149
 
150
  ## Updates and Versions
151
 
152
  - v1.0.0 (Initial Release)
153
+ - Base tokenizer implementation
154
  - Trained on Kirundi-English parallel corpus
155
  - Basic functionality and documentation
156
 
157
  ## Acknowledgments
158
 
159
  - Dataset provided by eligapris
160
+ - Hugging Face's Transformers and Tokenizers libraries