File size: 3,466 Bytes
09caf4b
 
 
 
 
 
 
 
 
 
 
530ce92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09caf4b
530ce92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
datasets:
- eligapris/kirundi-english
language:
- rn
library_name: transformers
tags:
- kirundi
- rn
---
# Kirundi Tokenizer and LoRA Model

## Model Description

This repository contains two main components:
1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
2. A LoRA adapter trained for Kirundi language processing

### Tokenizer Details
- **Type**: BPE (Byte-Pair Encoding)
- **Vocabulary Size**: 30,000 tokens
- **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
- **Pre-tokenization**: Whitespace-based

### LoRA Adapter Details
- **Base Model**: [To be filled with your chosen base model]
- **Rank**: 8
- **Alpha**: 32
- **Target Modules**: Query and Value attention matrices
- **Dropout**: 0.05

## Intended Uses & Limitations

### Intended Uses
- Text processing for Kirundi language
- Machine translation tasks involving Kirundi
- Natural language understanding tasks for Kirundi content
- Foundation for developing Kirundi language applications

### Limitations
- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
- Limited to the vocabulary observed in the training data
- Performance may vary on domain-specific text

## Training Data

The model components were trained on the Kirundi-English parallel corpus:
- **Dataset**: eligapris/kirundi-english
- **Size**: 21.4k sentence pairs
- **Nature**: Parallel corpus with Kirundi and English translations
- **Domain**: Mixed domain including religious, general, and conversational text

## Training Procedure

### Tokenizer Training
- Trained using Hugging Face's Tokenizers library
- BPE algorithm with a vocabulary size of 30k
- Includes special tokens for task-specific usage
- Trained on the Kirundi portion of the parallel corpus

### LoRA Training
[To be filled with your specific training details]
- Number of epochs:
- Batch size:
- Learning rate:
- Training hardware:
- Training time:

## Evaluation Results

[To be filled with your evaluation metrics]
- Coverage statistics:
- Out-of-vocabulary rate:
- Task-specific metrics:

## Environmental Impact

[To be filled with training compute details]
- Estimated CO2 emissions:
- Hardware used:
- Training duration:

## Technical Specifications

### Model Architecture
- Tokenizer: BPE-based with custom vocabulary
- LoRA Configuration:
  - r=8 (rank)
  - α=32 (scaling)
  - Trained on specific attention layers
  - Dropout rate: 0.05

### Software Requirements
```python
dependencies = {
    "transformers": ">=4.30.0",
    "tokenizers": ">=0.13.0",
    "peft": ">=0.4.0"
}
```

## How to Use

### Loading the Tokenizer
```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
```

### Loading the LoRA Model
```python
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification

config = PeftConfig.from_pretrained("path_to_lora_model")
model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, "path_to_lora_model")
```


## Contact

Eligapris

---

## Updates and Versions

- v1.0.0 (Initial Release)
  - Base tokenizer and LoRA model
  - Trained on Kirundi-English parallel corpus
  - Basic functionality and documentation

## Acknowledgments

- Dataset provided by eligapris
- Hugging Face's Transformers and Tokenizers libraries
- PEFT library for LoRA implementation